Edited by 

Walid Dabbous 
and Christophe Diot 



SPRINGER-SCIENCE+BUSINESS MEDIA, B.V. 





Protocols for 
High-Speed 
Networks V 




IFIP ~ The International Federation for Information Processing 

IFIP was founded in 1960 under the auspices of UNESCO, following the First 
World Computer Congress held in Paris the previous year. An umberella 
organization for societies working in information processing, IFIP’s aim is two-fold: 
to support information processing within its member countries and to encourage 
technology transfer to developing nations. As its mission statement clearly states, 

IFIP’s mission is to be the leading, truly international, apolitical organization 
which encourages and assists in the development, exploitation and application 
of information technology for the benefit of all people. 

IFIP is a non-profitmaking organization, run almost solely by 2500 volunteers. It 
operates through a number of technical conunittees, which organize events and 
publications. IFIP’s events range from an international congress to local seminars, 
but the most important are: 

• the IFIP World Computer Congress, held every second year; 

• open conferences; 

• working conferences. 

The flagship event is the IFIP World Computer Congress, at which both invited and 
contributed papers are presented. Contributed papers are rigorously refereed and 
the rejection rate is high. 

As with the Congress, participation in the open conferences is open to all and 
papers may be invited or submitted. Again, submitted papers are stringently 
refereed. 

The working conferences are structured differently. They are usually run by a 
working group and attendance is small and by invitation only. Their purpose is to 
create an atmosphere conducive to innovation and development. Refereeing is less 
rigorous and papers are subjected to extensive group discussion. 

Publications arising from IFIP events vary. The papers presented at the IFIP 
World Computer Congress and at open conferences are published as conference 
proceedings, while the results of the working conferences are often published as 
collections of selected and edited papers. 

Any national society whose primary activity is in information may apply to 
become a full member of IFIP, although full membership is restricted to one society 
per country. Full members are entitled to vote at the annual General Assembly, 
National societies preferring a less committed involvement may apply for associate 
or corresponding membership. Associate members enjoy the same benefits as full 
members, but without voting rights. Corresponding members are not represent^ in 
IFIP bodies. Affiliated membership is open to non-national societies, and individual 
and honorary membership schemes are also offered. 




Protocols for 
High-Speed 
Networks V 



TC6 WG6.1/6.4 Fifth International Workshop 
on Protocols for High-Speed Networks 
(PfHSN ‘96) 

28-30 October 1996, Sophia Antipolis, 
France 



Edited by 

Walid Dabbous 

and 

Christophe Diot 

INRIA Sophia Antipolis 
France 




Springer-Science+Business Media, B.V. 




First edition 1997 

© 1997 Springer Science+Business Media Dordrecht 
Originally published by Chapman & Hall in 1997 



ISBN 978-1-4757-6316-4 ISBN 978-0-387-34986-2 (eBook) 

DOI 10.1007/978-0-387-34986-2 

Apart from any fair dealing for ttie purposes of research or private study, or criticism or review, 
as permitted und^ the UK Copyright Designs and Patents Act, 1988, this publication may not 
be reproduced, stored, or transmitted, in any form or by any means, without the prior 
p^mission in writing of the publish«^, or in the case of rq>rographic reproduction only in 
accordance with the t^ms of the licences issued by the Copyright Licensing Agency in the 
UK, or in accordance with the terms of licences issued by the appropriate Reproduction Rights 
Organization outside the UK. Enquiries concerning reproduction outside the terms stated here 
should be sent to the publishers at the London address printed on this page. 

The publisher makes no rq>resentation, express or implied, with regard to the accuracy of the 
information contained in this book and cannot accept any legal responsibility or liability for 
any errors or omissions that may be made. 



A catalogue record for this book is available from the British Library 



Printed on permanent acid-free text paper, manufactured in accordance with 
ANSIMSO Z39.48-1992 and ANSI/NISO Z39.48-1984 (Permanence of Paper). 




CONTENTS 



Preface vii 

Committees viii 

PART ONE Transmission Control 

1 Estimating the available bandwidth for real-time traffic over best 
effort networks 

F. Davoli, O. Khan and P. Maryni 3 

2 A new algorithm for measurement-based admission control in integrated 
services packet networks 

C. Casetti, J. Kurose and D, Towsley 1 3 

3 Simulation analysis of TCP and XTP file transfers in ATM networks 

MA. Marsan, M, Soldi, A. Bianco, R. Lo Cigno and M. Munafo 29 

PART TWO Video over ATM 

4 A picture quality control framework for MPEG video over ATM 

A. Mehaoua, R. Boutaba and G. Pujolle 49 

5 Is VBR a solution for an ATM LAN 

O. Bonaventure, E. Klovning and A. Danthine 60 

PART THREE Communication Systems Architecture 

6 A fast, flexible network interface framework 

W.S. Liao, S.-M. Tan and R,H. Campbell 11 

I Multimedia partial order transport architecture: design and 
implementation 

M Fournier, C. Chassot, A. Lozes and M. Diaz 9 1 

PART FOUR Group Communication 

8 The case for packet level EEC 

C. Huitema 109 

9 Fully reliable multicast in heterogeneous environments 

J.F. de Rezende, A. Mauthe, S. Fdida and D. Hutchinson 121 

1 0 Reliable multicast: where to use FEX 

J. Nonnenmacher and E. Biersack 134 

I I Performance evaluation of reliable multicast transport protocol 
for large-scale delivery 

T Shiroshita, T. Sano, O. Takahashi, M. Yamashita, N, Yamanouchi 

and T, Kushida 149 





VI 



Contents 



PART FIVE ILP 

12 Integrated layer processing can be hazardous to your performance 



B. Ahlgren, M. Bjorkman and P. Ginningberg 167 

1 3 Automated code generation for integrated layer processing 

T, Braun and C. Diot 182 

PART SIX QoS 

1 4 Implementation and evaluation of the QoS-A transport system 

A. Campbell and G. Coulson 201 

1 5 User-to-user QoS - management and monitoring 

M. Zitterbart 219 

Index of contributors 235 

Keyword index 236 




Preface 



We are happy to welcome you to the IFIP Protocols for High-Speed Networks 
'96 workshop hosted by INRIA Sophia Antipolis. This is the fifth event in a 
series initiated in Zurich in 1989 followed by Palo Alto (1990), Stockholm 
(1993), and Vancouver (1994). 

This workshop provides an international forum for the exchange of 
information on protocols for high-speed networks. The workshop focus on 
problems related to the efficient transmission of multimedia application data 
using high-speed networks and internetworks. 

Protocol for High-Speed Networks is a "working conference". That explains 
we have privileged high quality papers describing on-going research and novel 
ideas. The number of selected papers was kept low in order to leave room for 
discussion on each paper. Together with the technical sessions, working 
sessions were organized on hot topics. 

We would like to thank all the authors for their interest. We also thank the 
Program Committee members for the level of effort in the reviewing process 
and in the workshop technical program organization. 

We finally thank INRIA and DRET for their financial support to the 
organization of the workshop. 
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abstract 

A mechanism for the estimation of the available bandwidth between two end-points of a best-effort 
network is presented. The estimation is obtained by a simple statistical analisys of the effect which the 
network has on a synchronous train of packets. The possibility of exploiting self-similar characteristics of 
the delay jitters is also discussed. 



Keywords: multimedia, load estimation, self-similarity. 



1 INTRODUCTION 

Distributed multimedia applications are characterized by the presence a mix of heterogeneous traffic flows 
(e.g., video, voice, data) with different statistical features and performance requirements. In this context, a 
number of services may require a very large bandwidth and stringent Quality of Service (QoS) constraints. 
This does not constitute a problem in the presence of high-performance networks (e.g., ATM), where the 
availability of large bandwidth and the presence of a number of control mechanisms can guarantee a wide 
range of QoS. Hovewer, there are still several situations (which are likely to last in the medium term), 
where the presence of networks with either limited capacity or “best-effort” service type may represent a 
potential bottleneck. 

As a matter of fact, the widespread diffusion of the Internet Protocol Suite (based on a best-effort 
service type) determines the development of multimedia applications which rely upon protocols with 
almost no QoS guarantees (e.g., UDP and IP), in both LAN and WAN frameworks. It is likely that such 
applications will continue to grow in the medium term and to receive a further thrust from the increase 
in the network capacity (e.g., IP over ATM). 

This scenario suggests that the multimedia environment will be characterized by a wide variety of 
traffic types and by the presence of networks with possibly very different capabilities. In particular, some 
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traffic types, like voice and video, will require QoS guarantees (e.g., in terms of maximum delay, delay 
jitter and throughput), whereas some networks will offer an asynchronous service which would be unable 
to ensure the required QoS. Control procedures should therefore be envisaged (like control of admission, 
congestion and flow), capable of operating at the edges of the best-effort networks, in a way not too 
dissimilar from the analogous ones performed at a lower architectural level for the ATM network. 

In this paper we address a problem which is strictly related to the implementation of such control 
functions, namely the estimation of the “available bandwidth” between two end-points of a best-effort 
network. 

By available bandwidth we mean here a quantity which makes it possible to estimate whether a given 
telecommunication service can be set up between two end-points of a network. 

The paper is organized as follows. The next Section describes the system model. The statistical analysis 
which leads to the estimation mechanisms is described in Section 3. In Section 4 we discuss some issues 
related with a self-similar approeich and in Section 5 we draw the conclusions. 



2 THE SYSTEM MODEL 

Consider two end points of a channel of a generic network, one of them acting as a transmitter while the 
other one acting as a receiver. We may think of a channel as a link of a network or a path composed 
of more than one link. The transmitter generates and sends a short packet to the receiver once every T 
seconds. The receiver, upon reception of a packet, computes its interarrival time, which will be used for an 
estimate of the load of the channel. In such a framework, if the channel is lightly loaded, it is reasonable 
to expect that also the receiver will receive a synchronous (or quasi-synchronous) train of packets. On 
the other hand, as the load of the channel increases, we expect to see, at the receiver site, a degradation 
of such a synchronicity. 

Figure 1 gives a pictorial representation of this simple model scheme. 

These measure packets do not need to carry information and, therefore, their size can be very small. 
Indeed, if T is chosen properly, the train of packets does not significantly alter the overall load of the 
channel. 

This way of estimating the load has several advantages. First, there is no need of polling the nodes of 
the channel (e.g., routers) in order to get load information: not everyone, in fact, can normally access 
routers and, when access rights are given, routers might have different access protocols (e.g., SNMP, 
CMIP). Second, this way of estimating does not require to know almost anything about the channel. In 
particular, we do not need to know the number and type of active connections on the channel in order to 
come up with an estimation. This estimating mechanism provides a way of measuring directly the part 
of the channel still available, the free part of it. 

Each valuable real time service (e.g., video, voice) has the need of a certain degree of synchronicity as 
a main Quality of Service requirement. By mapping the degree of synchronicity of the train of packets at 
the receiver for each service, we are able to say whether such a service can fit into the channel or network. 

A further study, beyond the scope of this work, is to map a connection of a given type with a level of 
degradation of the synchronicity of the train of packets. With such a further mapping we can say how 
many connections of a given type could fit into a channel. 

It is worth noting that not every source-destination pair needs generating the test train of packets 
(this procedure would be likely to generate an excessively high load in the presence of a large number of 
connections, even though the length of a test packet is of the order of few bytes). One possible alternative 
is to have a pair of “multimedia servers” performing this operation and distributing the results of the 
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Figure 1 The system model. 



analysis, either periodically (with period much larger than T) or on demand. In this case, the additional 
load caused by the measurement packets could really be considered as negligible. The presence of such 
“network elements” performing the estimation (in lieu of the sources themselves) would also have the 
advantage of by-passing possible source control mechanisms that might alter the statistics (e.g., traffic 
shaping or policing) . 



3 THE ESTIMATING MECHANISM 

The aim of the Estimating Mechanism is to come out with a measure of the level of synchronicity of the 
train of packets at the receiving end (Bolot 1993). Figure 2 shows the interarrival times relative to an 
experiment over a “channel” composed of two Ethernet LANs linked, through routers, by a 1.5 Mbit/s 
serial line, with T = 10 ms. For the first 50 seconds of the experiment (i.e., 5000 packets), the channel is 
lightly loaded and then a heavy ftp data transfer takes place. The effect of the heavy data transfer on the 
train synchronicity can be easily seen: before the ftp takes place, data are gathered around the value of 
T; immediately after the ftp, the presence of bursts of packets is demonstrated by the quite large number 
of data close to zero. 

Let us denote with Xk the time lag between the arrival instants of packets k and k — I, and let us 
consider first the interarrival time average. More specifically we focus on the average ^^„(z) computed 
over a window of m time lags. 
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Figure 2 Interarrival times seen by the receiver. 



Figure 3 shows this value computed over a window of 10 time lags (i.e., m = 10). The effect of the ftp 
connection is evident: averages are much more spread around the value of T. 
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Figure 3 “Sliding” averages (m = 10). 

The following step is to look at the standard deviation. More in particular, we considered a “sliding” 
standard deviation estimator defined as follows: 



— 



m*(* + l) 

(l/(m-l)) Yi 
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Figure 4 shows standard deviations <rio(2), computed over a window of 10 time lags: the presence of 
the ftp connection is also evident in this figure. 




Figure 4 “Sliding” standard deviation (m = 10) . 

Unfortunately, time lags, along with sliding averages and standard deviations only give a good “picto- 
rial” feeling of the degree of asynchronicity. In other words, they do not give a stable value for a reliable 
measure of the degree of asynchronicity. To see this, let us consider another experiment. In this second 
experiment, monitoring is performed on the same channel but, this time, no particular load situation is 
imposed on it: measurements have been taken during a normal working period. During this experiment, 
9000 packets have been involved in the measurements. If we look at the values of Xm, even for differ- 
ent values of m, we realize that, even if the value of T is persistent, the plot is everything but stable. 
Figures 5(a,b) show this point quite clearly. 




(M (b) 



Figure 5 Sliding averages for the second experiment, (a) m = 10; (b) m = 20. 

In order to get a stable value, we considered the e-percentile ^e,n(*) computed on a window of n sliding 
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situation 


0 

0 


empty 


1 


nv 


5 


ftp 


12 



Table 1 Percentile measurements in different situations. 



variances, where ^e,n(0 is such that on the set . . . , (i + l)], less than lOOe % of the values 

are smaller than looks like having all the nice characteristics we are looking for: it is easy to 

compute, it gives a stable load estimate and it is also quite fast in reacting to varying load conditions. 

Figure 6 shows the values of ^c,n(0 experiment. We set e = 0.3, n = 30, and m = 10 

(these figures work well over a large number of experiments). Hence, each value corresponds to the 0.3- 
percentile computed on a set of 30 sliding standard deviations which, in turn, are computed over 10 time 
lags. Notice that the plot abruptly changes after i = 17: this value corresponds to the 5100-th time lag 
(i.e., 17 • 30 • 10), which approximately is the time the ftp connection takes place. 




Figure 6 Plot of ^c,n(0 , m = 10, n = 30, e = 0.3. 

Table 1 shows measured values of ^ 0 . 3,10 in three different situations (those situations have been tried 
with no other users on the channel). The first one is the situation with no transmission at all on the 
channel; the second one is the popular “nv” application, which is a tool for sending a video signal to 
a remote workstation with different levels of quality; the third one is a plain “ftp” connection along the 
channel. 

These measurements can be utilized for an estimate of the ability of the network to satisfy the QoS for 
a given application. For example, we noticed that a “nv” connection cannot be run on that channel at its 
best quality (in terms of frame rate and/or window dimension) if an “ftp” is running at the same time. 
Hence, we know that if the measured value of ^ 0 . 3,10 is greater than 12 then no “nv” at its best quality 
is possible. We fixed three different levels of quality: a best .one, a medium one and a sufficient one. In 
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level 


^0.3,10 


best 


< 12. 


medium 


<20 


sufficient 


<25 



Table 2 Grades needed for the three different levels of quality. 



this framework, the terms “best”, “medium” and “sufficient” have to be intended as Perceived Quality of 
Service (PQoS) (Seiz 1994, Ferrari 1990). Table 2 shows the requirements for these three levels of quality. 



4 HOW ABOUT SELF-SIMILARITY ? 

Figure 7(a) shows the values of interarrival times for the second experiment we presented in this paper (the 
one Figure 5 refers to). The plots (b), (c), and (d) represent portions of plot (a) showed over different time 
scales. This suggests that a synchronous train of packets, when transferred over an asynchronous channel, 
gets to the receiver with inter arrival times which seem to look like having self-similar characteristics. Many 
examples can be found in literature which discuss self-similarity both in general terms (see for example 
Mandelbrot 1983) and within telecommunication systems (see for example Leland 1994, Erramilli 1993, 
Erramilli and Willinger 1993, Erramilli 1996). 

Another possible direction is, therefore, to exploit self-similarity. We considered the so-called R/S 
analysis because of its nice property, among others, of being simple and straightforward. Given a collection 
of samples Xi, i = 1, . . . , AT, the let us define RS(d) = R{d)/S(d), d < N, where; 



R(d) = max 



.*■ = 1 



r k 



mm 

l<k<d 






-X(d)) 



( 1 ) 



S(d) = 






( 2 ) 



and 




(3) 



R/S analysis is generally utilized to test for self-similarity of a set of collected data. The variable d 
can be interpreted as a time lag and the function R/S is able to capture persistence on different time 
scales, if any. Following this analysis, if data are characterized by self-similarity, then the function RS{d) 
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Figure 7 Time lags over different time scales, (a) 9 sec. ; (b) 5 sec. ; (c) 1 sec. ; (d) 0.5 sec. 



should show an asymptotic slope (the so-called “Hurst Effect”) which approximates (0.5 < H < 1). 
H is called “Hurst parameter” (Leland 1994). The R/S analysis of the second experiment is shown in 
Figure 8. In the Figure it is also shown that, for that data, the Hurst parameter H is approximately 0.7. 

Another powerful analysis is the so-called “V"ar(A”*)” one. This considers the variance of the series 
X^(k) — . -I- Xkm)y with k > 1. A long-range dependence, or heavy tail, can be 

detected by plotting Var(X'^) versus m on a log-log scale and looking for an asymptotic slope m~^ with 
0 < < 1. It can be shown that H = 1 — /?/2, where H is the Hurst parameter of the R/S analysis. 

Figure 9 shows the plot of log V’ar(A’”) versus logm in the three cases (i.e., empty, nv and ftp). As 
regards the asymptotic slope, (3 is about —0.75. Notice that the parameter /? is sensitive to the load con- 
ditions. This suggests that also self-similar techniques could be utilized for load estimates. Unfortunately, 
self-similar parameters usually represent asymptotic behaviours. In other words, when it comes to exploit 
self-similar analysis, we have to remember that many time scales are involved and that, therefore, we 
need to analyze a large number of samples. Hence, this techniques are scarcely applicable in a real-time 
framework, where the estimating mechanism must be fastly reactive to load variations. 
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Figure 8 Hurst effect with R/S plot. 




Figure 9 The Var(X'^) analysis. 



5 CONCLUSIONS 

A mechanism for estimating the available bandwidth between two end points of a best-effort network 
has been presented. The purpose of this mechanism is to carry out the estimation procedure with no 
information the overall offered load. Focus has been primarily put on real-time services even though also 
non real-time ones often have maximum delay requirements. The mechanism works by performing, at one 
side of the network, a set of statistical operations on the measured delay of a train of packet synchronously 
transmitted by the other end point. 

As regards the possible applications of the scheme, we can mention two cases. One possibility is to use 
the information on the available bandwidth to control the rate of real-time sources (e.g., video), e.g., by 
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decreasing it equally whenever a threshold is exceeded. A second one is in admission control; in this case, 
however, one should keep in mind that tere is a certain amount of uncontrollable “background traffic” 
(e.g., ftp), which might produce significant changes after a connection admission/refusal decision has 
been taken. The two mechanisms might be employed jointly. This will be the subject of future work. 
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Measurement-based Admission 
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Networks* 



Claudio Casetti 
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Abstract 

The purpose of call admission control in Integrated Services Networks is to offer a guarantee 
that Quality of Service (QoS) bounds are not violated due to the admission of new calls into 
the network. This is typically accomplished using the declared worst-case traffic descriptors for 
incoming calls, a solution which results in poor bandwidth utilization. A measurement-based 
admission scheme is an appealing alternative: not only does it offer adaptivity to changing traffic 
conditions, it also allows statistical multiplexing gains to be exploited. In this paper, we examine 
the problem of determining which traffic characterization a measurement-based admission control 
algorithm should require of sources requesting access. Building on the work of Jamin et al. (1995), 
we also propose an adaptive measurement-based admission control algorithm that simplifies the 
estimation process, and show that it can achieve a high level of utilization without violating its 
delay-based QoS guarantees. 



*This work was supported in part by the National Science Foundation under grants NCR-95-08274 and CDA- 
9502639 
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1 INTRODUCTION 

The increasing bandwidths offered by today’s high-speed switching and transmission technologies 
have enabled a new generation of real-time applications that require Quality of Service (QoS) 
guarantees from the network. As a result, numerous researchers - Braden et al. (1994), Clark et 
al. (1992), Ferrari et al. (1994), Floyd and Jacobson (1995), Hyman et al. (1991), Kurose (1993), 
Towsley (1993) and Shenker (1995a) - have advocated extending the Internet service model 
to accommodate the varying bandwidth, delay bound, and loss tolerance requirements of these 
applications. 

One widely discussed network service model is one which provides “hard” guarantees to an 
application. In this case, an application is guaranteed a specified bandwidth, maximum end- 
to-end delay and no packet loss. While such a guaranteed service is appropriate for CBR-like 
applications with strict loss and delay requirements, it may not be appropriate for variable bit 
rate, adaptive, real-time applications, described in Clark et al. (1992), that can both tolerate 
and adapt to certain amounts of loss and/or end-to-end delay. Thus, additional service models 
that provide “softer” or statistical (rather than hard) guarantees have been proposed in Braden 
et al. (1994), Clark et al. (1992), Jamin et al. (1994.) and Wrockawski (1995). These alternate 
service models seek to provide such guarantees while still exploiting the advantage of statistical 
multiplexing at routers. Several recent IETF Internet Drafts, such as Breslau and Shenker (1995), 
Shenker (1995b), Shenker and Partridge (1995), Shenker, Partridge, Davie and Breslau (1995), 
lay the foundation for an Integrated Services Internet architecture based on such service models. 

Of particular interest to us in this paper is the proposed Integrated Services Internet service 
class known as predictive service - see Braden et al. (1994) and Jamin et al. (1995). With 
predictive service, as well as with other service classes, such as the so-called “controlled-load 
service class” defined in Wrockawski (1995), measurements of ongoing admitted calls are used 
together with the declared traffic parameters of a call seeking admission to the network in making 
the call acceptance/rejection decision. The goal of this admission control process is to admit a 
call only if there are sufficient resources to satisfy the QoS requirements of the new call, while at 
the same time not violating QoS guarantees made to existing, already-admitted calls. 

In this paper, we propose and evaluate a new measurement- based admission control algo- 
rithm. Our work extends existing work on measurement-based admission control, namely Clark 
et al. (1992) and Jamin et al. (1995), in three important directions: 

• We examine the question of how best (from an admission control standpoint) to character- 
ize a source among several “equivalent” leaky bucket source characterizations. We find that 
among equivalent characterizations, an admission control algorithm that uses a characteri- 
zation with a smaller token bucket size and a higher token rate can often achieve an overall 
higher utilization than with other equivalent characterizations, without significantly increasing 
the percentage of packets experiencing a delay violation. 

• We investigate which performance metrics should be measured. We show that measuring rate 
rather than delay provides us with more stable estimates and results in a simpler admission 
scheme. 

• We consider how to adaptively adjust the length of the window of time over which measure- 
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merits are made. This obviates the need to choose, a priori, a length for a fixed measurement 
window. We show that our algorithm achieves a high utilization while being able to keep its 
delay commitments. Our study is chiefly conducted through simulation. 

The remainder of this paper is organized as follows. In section 2, we provide a brief overview 
of past work in the area of measurement- based admission control, and summarize the specific 
algorithm proposed in Jamin et al. (1995), tailoring the description to the simple network setting 
we consider. Section 3 addresses the issues of source characterization and suggests that measuring 
sources’ rates, rather than their packet delays, may be a better choice for a measurement-based 
admission control process. Our adaptive, window-based measurement and admission control al- 
gorithm is detailed in section 4. Section 5 presents simulation results of this algorithm under 
different traffic scenarios. Section 6 concludes the paper. 



2 PREVIOUS WORK ON MEASUREMENT-BASED ADMISSION 
CONTROL 

While the problem of admission control to a shared resource has been addressed by many re- 
searchers in recent years, the idea of using on-line measurements in the admission control process 
has been studied in only a few selected works. 

Tedijanto and Giin (1993) propose an admission control mechanism based on the on-line 
estimation of the equivalent bandwidth of an ATM connection using a dynamically adjusted token 
bucket. Also in an ATM framework, but considering a superposition of sources rather than a single 
connection, the work by Dziong et al. (1995) employs a Kalman filter to estimate the aggregate 
effective bandwidth for admission control purposes. Vin et al. (1994) present an observation- 
based mechanism to estimate the retrieval time of a media block from a multimedia server, using 
such information to grant access to clients which requested predictive service. Measurement- 
based admission control for predictive service in Integrated Services Packet Networks was first 
introduced in Clark et al. (1992) and then further developed in Jamin et al. (1995), where a delay 
and rate measurement scheme for predictive service was proposed. As our work builds on Jamin 
et al. (1995), we next describe the admission scheme proposed in that paper. The interested 
reader is referred to Jamin et al. (1995) for further details. 

Since our extensions will be considered in the context of a simpler network model than the 
one considered in Jamin et al. (1995), we first describe our network model. 

2.1 Network Model 



We consider the network scenario shown in Fig. 1, consisting of a link of capacity p connecting 
two switches. Each switch is assumed to have an infinite buffer, so that packet loss probability 
is not a factor in our study. This allows us to focus on other QoS metrics such as delay and 
throughput. 

Sources connect to the first switch through links of infinite bandwidth. The first switch coiicen- 
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Figure 1 Network Scenario 



trates (multiplexes) packets destined to the downstream switch. All traffic from the sources be- 
longs to a single predictive traffic class. While there is no guaranteed-service traffic in our model, 
we note other traffic classes can be handled using such techniques as Weighted Fair Queueing - 
see Clark et al. (1992), Demers et al. (1989), Parekh and Gallager (1993) and Golestani (1994) 
- or class- based queueing as described in Floyd and Jacobson (1995). 

2.2 Source Model 

New calls arrive to the leftmost switch in Fig. 1 according to a Poisson process with rate calls 
per second. Call holding times are independent and identically distributed exponential random 
variables with mean T^. The traffic generated by an accepted call is modeled by an On/Off 
Markov-modulated fluid process. While in the On state, a source generates traffic at peak rate r; 
during the Off period no traffic is generated. In our simulations, unless otherwise noted, we use 
// = 10M6/s, Vs = 10ca//s/s, and Ts = 1355. 

On and Off periods are exponentially distributed with mean 1/a = 0.31255 and 1//^ = 0.31255 
respectively. Depending on the scenario we are considering, the peak rates of admitted sources 
are either fixed (we refer to these as homogeneous sources) or exponentially distributed with 
mean r = 64Ar6/5 (heterogeneous sources). Our motivation for choosing fluid sources, rather than 
a discrete, packetized source model will be explained in section 3. We note that with the choice 
of the parameters above, a network without any admission control mechanism would achieve a 
near 100 % utilization. 



2.3 The Admission Algorithm 

We now describe the measurement- based admission algorithm proposed in Jamin et al. (1995), 
in the context of the simple network scenario described above. The key to the admission process 
is that the new call (or “flow” of data) should be accepted only if: 

• the current traffic conditions at the switch can ‘guarantee’ that the call receives the QoS it 
requests. In the following we will precisely define what we mean by ‘guarantee’. 

• the acceptance of the new call does not cause QoS guarantees previously made to admitted 
flows to be violated. 
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To achieve these goals, a call seeking admission is required to provide a traffic characterization, 
known as a TSpec - see Shenker and Partridge (1995) and Wrockawski (1995). 

Specifically, the call characterizes its traffic with two token bucket parameters: p (a token 
generation rate) and a (a bucket depth) such that over a period of time T, the traffic generated 
by the call will not exceed pT + cr. We will address the problem of choosing appropriate values 
of p and <7 in section 3. 

The measured system quantities are: 

• Z), the delay experienced by traffic in the switch queue; 

• z/, the aggregate traffic rate at the switch. 

As we will see, as a result of the measurement process, at any given time the call admission 
algorithm will have an estimate of these two quantities, D and i>. Given these estimates, an 
arriving call is granted admission if the following two conditions hold - see Jamin et al. (1995): 

1. Given a current delay estimate, a new delay estimate, Z),' is formed such that Z)' = ^ 

Note that ^ is the amount of time to transmit a full token bucket’s worth of data from the 
new call. This new delay estimate must satisfy: 

D bound > i)' = D ( 1 ) 

p 

2. If the upper-bound rate declared by the new call, p, is added to the current estimate of the 
aggregate traffic rate, i>, the new estimate of the rate, i)', satisfies: 

vp > u' = i> + p (2) 

the utilization bound, up, is referred to as the utilization target. 

If the new call is admitted, the new estimates for delay and rate are then taken to be D' and 
We now discuss how these values are subsequently updated. 

2.4 The estimation process 

In the following, we will find it useful to distinguish between “estimates” and “measurements” 
of delay or rate. “Measurements” are the observed delay values made over time, and “estimates” 
are constructed using these measurements, as well as the declared traffic parameters of arriving 
calls. 

Delay measurements are collected for a fixed period of time known as the “measurement 
window.” In the simplest case, at the end of a measurement window, an estimate is updated to 
be the highest observed value of that measure during the measurement window. In certain cases, 
however, the measurement window will be terminated prematurely and the estimates updated 
in a different manner. Specifically, as in Jamin et al. (1995): 
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• When a new call is admitted, the delay and rate estimates are updated using equations (1) 
and (2). It is worth noting that these estimates are not updated when a call terminates. 

• When a single measurement exceeds the current estimate, the estimate is updated to be A 
times the sampled value. The factor A allows us to be more conservative by overestimating 
the quantity, resulting in more conservative admission control when the measure is close to 
the bound. In Jamin et al. (1995) it is suggested that A = 2 be used for delay and A = 1 for 
rate estimates. We will use these values. 

In our fluid-flow simulation model, a single, ‘instantaneous’ delay measurement is computed by 
dividing the switch queue length at that time by the queue’s service rate //. Since there is no such 
thing as an ‘instantaneous’ rate measure, a small interval S is specified and a rate measurement 
is taken as the number of bits transmitted on the link during this interval. Jamin et al. (1995) 
suggest using S = 500ms. 

It should be pointed out that the quality of the call admission scheme in Jamin et al. (1995) 
is highly dependent on the values of the so-called ’performance tuning knobs’ (target utilization, 
A, length of the measurement window). Selection of these values is an open problem which 
goes beyond the simple task of fine tuning the behavior of the admission scheme. Arguably, 
the most challenging issue is finding the ‘right’ length of the measurement window, since the 
performance of the admission scheme heavily relies on it. As noted in Jamin et al. (1995), too 
short a window lowers the admission threshold as soon as the traffic thins out, leaving the 
network dangerously exposed to bursts. On the other hand, long measurement windows lead 
to an excessively conservative scheme by delaying the time when flows’ TSpecs (that give a 
worst case traffic characterization) are replaced with on-line measurements (that reflect that 
call’s actually generated traffic). Perhaps the most important factor is that without an a priori 
knowledge of the traffic, it is difficult to determine an appropriate length of the measurement 
window. For these reasons, a measurement window which adapts to network traffic is desirable. 
We present such an adaptive algorithm in section 4. First we consider the problem of source 
characterization. 



3 SOURCE CHARACTERIZATION AND MEASURED PARAMETERS 

3.1 How should we characterize sources? 

The admission control process discussed above uses token bucket parameters as a source traffic 
characterization. 

In our simulations, we will model sources as Markov-modulated fluid sources whose smallest 
unit is the bit. By virtue of this choice, we are not committed to any packet size, flowever, tokens 
are representative of whole data units like cells and packets. We assume that a token is equivalent 
to a 1 kb data unit, as in Jamin et al. (1995). 

Following Elwalid and Mitra (1991), we consider bufferless leaky-bucket-constrained Marko\ - 
modulated On/Off fluid sources with parameters (cr, p). But what values of {cr, p) should be 
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chosen? In Jamin et al. (1995), it is suggested that (cr, /?) pairs be chosen empirically as the 
smallest a and p that result in the desired loss rate at the token bucket. Of course, such strategy 
cannot be employed if the traffic characteristics of the ON/OFF sources have not been evaluated 
in advance. Using equations (2.18), (2.21) and (8.8) in Elwalid and Mitra (1991), we can derive 
the following expression relating the bucket depth, cr, the token rate, p, the source peak rate, r, 
and the source’s fluid loss probability, tt at the leaky bucket: 



a = -{r- p) 






( 3 ) 



For a given bucket size, eq. (3) can be used to obtain the theoretical value of the token rate 
which results in a target loss probability of tt at a bufferless token bucket. We chose tt = 10“^. 
We will refer to (<J, p) pairs with the same target loss probability as equivalent. 

Given the set of equivalent source characterizations satisfying eq. (3), which source character- 
ization should we choose? Ideally, we would like to choose a characterization that, when coupled 
with the admission control process described in section 2, results in high link utilization at the 
switch without excessive QoS violations. In order to explore the impact of using different “equiv- 
alent” source characterizations, we ran simulations using the call admission algorithm and the 
homogeneous source model described in section 2. Delay and rate estimates were computed using 
the procedure outlined above, with a delay bound of 16ms and a utilization target of 95 % (which 
yields a rate bound of 9.5 Mb/s). 

Tables 1 and 2 summarize the results we obtained for measurement window lengths of 1 and 
10 seconds, respectively. Simulations were run for a total of 10000 seconds of simulated time 
each, which is at least 3 orders of magnitude longer than the measurement window dynamics. 
The computed utilization values were such that the width of the 99.8% confidence interval was 
within 1% of the point estimate; the percentage of late bits showed significantly more variability. 
Each row in a table shows for an equivalent (cr, p) pair. 



• the resulting utilization of the switch output link; 

• the percentage of bits that violated the delay bound; 

• the maximum delay in ms observed during the simulation (i.e. the maximum queue length 
divided by the service rate); 

• the percentage of flows, among those rejected, that were rejected due to delay constraint 
violations (see eq. (1)); 

• the percentage of flows, among those rejected, that were rejected due to target utilization 
constraint violations (see eq. (2)). 



Thus, for example, the first row of Table 2 indicates that among those flows rejected, 3.74 % 
violated the delay constraint and 99.76 % violated the target rate constraint. Note that the 
percentages do not add up to 100 %, since there was a portion of flows that violated both 
bounds, 

Several observations are in order. Comparing Tables 1 and 2 indicates that a longer mea- 
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P 


% Utilization 


% Late Bits 


Max. Obs. Delay 


% Delay Rej. 


% Rate Rej. 


1 


64 


74.53 


0.0 


0.334 


0.0 


100.00 


4 


62.8 


74.46 


0.0 


0.0 


0.73 


99.27 


10 


61.1 


55.35 


0.0 


0.0 


98.28 


1.72 


20 


58.8 


32.24 


0.0 


0.0 


100.00 


0.00 





Table 1 Equivalent token buckets with measurement window = 


10 s 


(7 


P 


% Utilization 


% Late Bits 


Max. Obs. Delay 


% Delay Rej. 


% Rate Rej. 


1 


64 


91.30 


0.37 


65.32 


3.74 


99.76 


4 


62.8 


91.25 


0.35 


41.55 


4.12 


99.67 


10 


61.1 


91.25 


0.29 


39.40 


6.53 


98.54 


20 


58.8 


90.70 


0.18 


43.15 


23.60 


86.07 



Table 2 Equivalent token buckets with measurement window = 1 s 



surement window results in more conservative admission control: with a 10 second window, no 
late bits were observed, while with a smaller window (Is), there was a small amount of delay 
violation. However, the achieved link utilization is much higher in the case of the small window. 

Interesting observations can also be made regarding the different bucket sizes used in the source 
characterization. The larger the bucket declared by the source, the more likely it is that a flow 
requesting admission is denied access on the grounds that it might violate the delay bound. This 
is because a larger a in (1) provides a higher delay estimate. However, with a large cr, a number 
of these flows would appear to have been unnecessarily rejected, since none of the admitted 
flows ever experienced a delay violation for any of the values of a shown. We observed a similar 
behavior for other window lengths, as well as with bucket sizes larger than 20. We also observed 
similar behavior with heterogeneous sources, where only the bucket size is fixed and the token 
rate is computed on-line for each newly generated call with random peak rate, using eq. (3). 
We omit those results for reasons of space. Given these observations regarding equivalent token 
bucket characterizations, for the remainder of this paper we will use cr = 1, which results in a 
value of p equal to a source’s peak rate. 

3.2 Which performance measures should be observed? 

The admission scheme described above makes use of both delay and rate measures. An adaptive 
window algorithm would be noticeably simpler if it had to rely on one parameter only. 

Tables 1 and 2 sliow that, especially for small values of a, the large majority of rejected calls 
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Window Length [s] 


Avg. Delay 


Avg. Rate 


Delay Variance 


Rate Variance 


1 


9.51 


9.24 


543.8 


1.55 


3 


0.48 


8.49 


7.994 


1.33 


5 


0.056 


8.01 


0.705 


1.22 


10 


0.0371 


6.99 


0.493 


1.12 



Table 3 Average and variance of delay and rate measure 



are rejected on the basis that they violate the rate criteria (2). An argument for using rate 
measurements rather than delay measurements is provided by a simulation experiment in which 
we evaluated the mean and the variance of both delay and rate measures (not estimates) using 
the window length as a parameter. Our purpose in doing so is to determine which is a more 
“stable” estimate. The simulation scenario is essentially the same as in the previous section. 
However, rather than running fixed- length simulations, we used the batch means approach in 
Fishman (1991) as a stopping criteria. The simulations were run until the width of the 90% 
confidence interval was within 10% of the point estimate. 

Table 3 shows the mean and variance estimates for different measurement window sizes, for 
the case that the admission control algorithm had a utilization target of 100 %. The average 
delay is in ms, its variance in ms^, while the average rate is in Mhj s and its variance in [Mb! s)^. 
Somewhat expectedly, these results discourage the use of delay estimates. Not only is the variance 
extremely high for short windows, but is also very sensitive to the window length. Also, note the 
impact of varying measurement window lengths on delay and rate average. While the rate merely 
shows a 25 % decrease over the range of window sizes considered, the average delay values differ 
by 2-3 orders of magnitude. 

The adaptive measurement-based admission control algorithm we present in the following sec- 
tion thus only uses rate measurements. We note that this also fits well with the IETF specification 
of the Controlled Load Service in Wrockawski (1995), in that it does not make use of target val- 
ues for such direct performance measures (e.g., delay or loss) but rather ensures that adequate 
bandwidth is available to handle the declared traffic rate. 



4 A NEW MEASUREMENT-BASED ADMISSION ALGORITHM 

In the previous sections we have discussed several aspects of measurement-based call admission 
control. We enumerate these, and other considerations: 



Proposition 1 A measuremeni-based admission algorithm should: 
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Figure 2 First-level algorithm Figure 3 Second-level algorithm 



• avoid using fixed-length measurement windows, on the grounds that the traffic characteristics 
are usually unknown, or can vary; 

• adapt to changing traffic conditions, enlarging or shrinking its measurement window so as to 
realize a more or less conservative admission process; 

• require sources to characterize their traffic using only the peak rate; 

• rely upon rate, rather than delay measurements. We note, however, that delay measurements 
could be used in conjunction with rate measurements. 



The adaptive measurement-based call admission algorithm described below is divided into 
two “levels”. The first-level algorithm, which in essence forms the core of the algorithm, uses 
rate measurements and the declared traffic parameters of accepted calls to adaptively adjust the 
length of the measurement window. The second-level algorithm adjusts the values of parameters 
used by the first level algorithm. 

The idea behind the first-level algorithm is quite simple. The algorithm continually shrinks the 
length of the measurement window (resulting in admission control decisions that work towards 
increasing link utilizations; see Tables 1 and 2, and the discussion in Jamin et al. (1995)) until the 
amount of traffic generated by accepted calls reaches a trigger value. This trigger value is actually 
a rate value, smaller than the output link capacity, providing an early warning that the system is 
about to reach a load level where it may be requested that some form of control be implemented 
(this does not necessarily mean that we are approaching overload, see Wrockawski (1995) for 
further discussion). Ihe first-level algorithm then reacts by enlarging the measurement window 
until the measured rate drops below the trigger, at which point the window can be shrunk again. 
In a sense, the admission control algorithm operates like TCP congestion control - it tries to 
be more and more aggressive in admitting calls until something “bad” happens. At that point 
it backs off (i.e., it shrinks the window, resulting in a stricter call admission process) but after 
backing off again begins to decrease the measurement window. 

Fig. 2 illustrates the first level algorithm. The variables used are: 
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• VK, length of the measurement window. In our simulations, we do not allow the measurement 

window to become smaller than 1 second. 

• /up, window enlargement factor 

• fdowTi’, window shrinking factor 

We will refer to /down and fup as window shaping factors. 

Although the algorithm does away with the “performance tuning knobs” and fixed length 
window in the original algorithm (see Jamin et al. (1995) and section 2), it appears that we have 
traded these for equally uncertain parameters: the trigger value and the shaping factors. The 
second-level algorithm addresses this issue. 

Fig. 3 illustrates the second-level algorithm. It lowers and raises the trigger in search of a 
good operating point, in response to traffic fluctuations. The algorithm reacts to delay bounds 
violations while cushioning the negative effects of the delay variance. Two delay-related measures 
are used. The first delay measure is the instantaneous delay violation percentage; it provides a 
warning that a percentage of bits in the switch’s queue are violating the delay bounds (thus it is 
easily implementable). The second delay measure is the “long term delay violation percentage’’ ; 
it is the percentage of late bits observed since time 0. We stress that this is only one of the 
possible solutions: an alternative scheme might choose to address the instantaneous delay alone 
rather than also including the long term percentage. 

The algorithm operates as follows: at time 0 we set a target value for the long term delay 
violation percentage, which we strive to guarantee over n hours (the target commitment period). 
On resetting the measurement window, whenever the first level algorithm attempts to change the 
window length, it checks whether the instantaneous delay percentage is higher than the target 
delay percentage. If so, the admission algorithm has been too aggressive, therefore the trigger 
rate value is lowered, resulting in more conservative admission control. If the instantaneous delay 
percentage does not exceed the target, the algorithm checks whether the long term delay violation 
percentage is currently under the target value. If it is, the window trigger value is raised. If it is 
not, no action is performed. This last choice is motivated by the fact that if the instantaneous 
target delay is not violated, the trigger is sufficiently low and the overall delay only needs more 
time to settle under the target. In order to lessen the scheme’s sensitivity to delay variations, we 
suggest additive rather than multiplicative trigger increments. Of course, the trigger cannot be 
higher that the rate bound, i.e., the output link capacity. We will indicate trigger increments by 

I up s-U-d td own • 



5 SIMULATION RESULTS 

We now evaluate the proposed algorithm under different traffic scenarios. The first scenario uses 
the source model described in section 3.1. The algorithm’s parameters are; 

• u = 1-2 

• fioron = 0.9 
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• tup : O.OIM6/5 

• tdown ■ 0.05Mbls 



In our first set of results, the target delay percentage was guaranteed over a period of 12 
hours, and a delay bound of 16 ms was chosen. Ideally, a 12-hour period would allow network 
administrators to alternate between more and less stringent commitments depending on the 
portion of the day when the network is likely to be more congested. The trigger wcis initially set 
at 9 .OM 6 / 5 . 

Tables 4 and 5 show the performance of the two-level algorithm using different target values 
for the long term delay violation percentage, ranging from 0.01% to 2%, for both homogeneous 
and heterogeneous sources. As expected, as the target (allowable) percentage of delay violations 
increases, so too does the achieved utilization. Note that both types of sources adhere to the 
guaranteed target, i.e., the percentage of late bits observed (column 4 in the tables) closely 
matches the long term delay violation target values shown in column 1. We also observe that 
homogeneous sources achieve a higher utilization than heterogeneous sources. This is because 
their fixed rate yields more ‘predictable’ behavior, and increased predictability accounts for 
more reliable estimates at the time of admission. This, in turn, yields fewer violations and the 
opportunity of better exploiting the available bandwidth, thus increasing utilization. 

Figures 4 and 5 provide a useful insight into the algorithm. Fig. 4 plots the percentage of late 
bits since time 0 for three different values of the target long term delay violation percentage 
(1%, 0.1% and 0.01%) as a function of time for heterogeneous sources. For each curve, we can 
identify two critical regions in Fig. 4: a transient showing a peak and a curve converging to the 
target value, and a steady state region capturing the algorithm’s effort to preserve the target 
percentage. The transient peak is generated by the lack of information available to the network 
at the beginning of each target commitment period, resulting in the algorithm being unable to 
find a stable operating point. Whereas both the 1% and the 0.1% case reach the steady state 
long before the commitment period has expired, the 0.01% target curve can not reach the steady 
state regime within the time period studied due to the high initial transient peak. Actions that 
might reduce it include setting a lower starting value of the trigger or initializing the window to 
several hundred seconds. 

The trigger value dynamics for the three curves from Fig. 4 are shown in Fig. 5. It is interesting 
to observe how trigger peaks are matched to regions where the percentage of late bits is below 
the target (this is particularly evident for the 0.1% case at approximately times 13000, 23000, 
32000 and 41000 seconds). Indeed, the algorithm raises the trigger each time the target value is 
reached. 

Thus far, our simulation parameters resulted in a network operating in overload. In the next 
set of results, we investigate the algorithm’s response to traffic fluctuations. Specifically, the 
call arrival process was modified so that every 100 seconds on average (values were actually 
determined as independent identically distributed random variables with mean 100), the call 
arrival rate was switclu'd between 10 sources per second with probability 2/3, and 0.01 sources 
per second with })robability 1/3. Heterogeneous sources were used. Although the ensuing traffic 
pattern is not meant to model actual network traffic, it is nevertheless interesting to test how the 
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Target [%] 


Avg. window size 


% of adm. flows 


% of late bits 


Utilization [%] 


0.01 


4.71 


25.3 


0.0122 


85.85 


0.05 


3.94 


25.7 


0.0530 


87.19 


0.10 


3.95 


26.1 


0.1201 


86.94 


0.20 


3.48 


25.9 


0.2058 


88.06 


1.00 


2.19 


27.4 


1.0139 


91.00 


2.00 


1.72 


27.7 


2.0060 


92.36 



Table 4 Two-level algorithm - homogeneous sources 



Target [%] 


Avg. window size 


% of adm. flows 


% of late bits 


Utilization [%] 


0.01 


7.15 


21.3 


0.0232 


74.86 


0.05 


6.00 


22.5 


0.0498 


77.32 


0.10 


5.65 


22.7 


0.1048 


78.05 


0.20 


4.65 


23.4 


0.2080 


80.82 


1.00 


3.30 


24.4 


0.9988 


84.51 


2.00 


2.72 


25.8 


1.9809 


86.59 



Table 5 Two- level algorithm - heterogeneous sources 





Figure 4 Percentage of overall late bits, as 
a function of time, for different delay targets 
- heterogeneous sources 



Figure 5 Trigger values as a function of 
time, for different delay targets - heteroge- 
neous sources 
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Target [%] 


Avg. window size 


% of adm. flows 


% of late bits 


Rel. Util. [%] 


0.01 


3.25 


22.1 


0.0708 


45.67 


0.05 


2.97 


24.0 


0.1076 


57.08 


0.10 


2.50 


26.5 


0.2923 


48.29 


0.20 


2.00 


30.2 


0.2344 


59.71 


1.00 


1.51 


34.6 


1.0142 


67.77 


2.00 


1.40 


35.3 


2.0171 


68.63 



Table 6 Two-level algorithm - sensitivity to traffic fluctuations 



algorithm handles this rather extremal situation where prolonged underload periods are followed 
by sharp traffic peaks. The trigger was initialized at IMhjs to reduce the transient peak. 

Using the same algorithm parameters listed above, we obtained the results in Table 6. Fig. 6 
plots the number of admitted and active sources during the simulation (to improve readability, 
only the first 10000 seconds are shown). Note that the algorithm fails to meet the target value of 
the long term delay violation percentage when the target is smaller that 0.1%. For this case of a 
non-stationary call arrival process, we show a relative measure of utilization (the ratio between 
the utilization which could be achieved if no admission control were enforced (approx. 77 %), 
and the utilization using admission control). Of course, the closer the relative utilization is to 
100 %, the lesser the impact of admission control on link utilization. 

So far, we have taken for granted that our values for the window shaping factors and the target 
increments were good ones. Fig. 7 gives an indication of how sensitive our algorithm is to the 
choice of tup and tdown- The figure shows the percentage of late bits with a target value of 0.05 % 
for the long term delay violation percentage, heterogeneous sources and constant overload (i.e. 
no traffic fluctuations). The increments are expressed in kb/s and range from 1 to 1000. It can 
be seen that we should take extra care in choosing the values of tdown- Also, very low decrements 
appear to have little effects on keeping the percentage of late bits close to the target, unless 
paired with equally low increments. We have observed a similar behavior for other scenarios as 
well, confirming a simple “rule of thumb” - that decrements should be higher than increments. 
While the results so far have given us an indication as to what range these values should be 
chosen from, a robust procedure for picking these values remains a topic for future research. 



6 CONCLUSIONS AND FUTURE WORK 



The issue of measurement-based admission control was the topic of this paper. Through the use 
of simulation, we have addressed the problem of what source characterization and what measured 
quantities best suit a measurement-based admission control scheme for predictive service. This 
enabled us to devise a simple admission algorithm, relying on rate measures only and using only 
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Figure 6 Two- level algorithm - Admitted 
and active sources as a function of simu- 
lated time - heterogeneous sources 



Figure 7 Percentage of overall late bits as 
a function of trigger increments and decre- 
ments - heterogeneous sources 



the source peak rate as TSpec. A delay bounding scheme was used together with the admission 
algorithm to provide delay guarantees over a period of time. The analysis of our results have 
shown the algorithm and the bounding scheme to perform well under overload conditions, while 
extremal traffic fluctuation have proved to be more difficult to handle, particularly under tight 
guarantees. 

Our ongoing work is investigating the elFect of traffic fluctuations and more thoroughly ex- 
ploring the algorithm’s sensitivity to its tuning parameters. We are also studying the algorithm 
under empirical traffic patterns, such as video traces. 
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Abstract 

In this paper we study the TCP and XTP transport protocols in light of their possible use 
for applications running over B-ISDN in a UBR-like context. In particular, we investigate the 
performance of file transfer applications taking into account both the impact of the ATM layer 
protocols and the features of the adopted transport protocol. The investigation is performed by 
simulation, using CLASS, a simulation tool for the study of ATM networks at the cell and burst 
levels, and considering quite a simple topology, known as the bottleneck topology, where TCP 
or XTP connections share a bottleneck link with some interfering background traffic, that is 
taken to have either Poisson or ON-OFF characteristics. Although the two transport protocols 
have quite different features, their performances are rather similar in almost all the considered 
scenarios. However, when XTP and TCP connections are mixed within the same network, the 
TCP performance is dramatically worse than that achieved by XTP, due to a more aggressive 
policy of XTP in the access to the transmission medium: a behavior that should raise some 
concern about the performance achievable when mixing different transport protocols on the 
same network. 

1 INTRODUCTION 

Much research is in progress to investigate traffic and performance aspects of ATM networks, so 
as to correctly engineer B-ISDN services. While recently a lot of attention has been devoted to the 
performance of TCP [Transmission Control Protocol - see Van Jacobson (1990)] when used over 
ATM networks [see Romanow and Floyd (1994), Bianco (1994), Ajmone Marsan et al. (1996), 
Lakshman et al. (1996), Li et al. (1996)], to our knowledge no study was yet published about the 

*This work was supported in part by a research contract between CSELT and Politecnico di Torino, in part by 
the EC through the Copernicus Project 1463 ATMIN, and in part by the Italian Ministry for University and 
Research. 
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performance of XTP [Xpress Transport Protocol - see XTP Forum (1995)] over ATM networks. 
The comparison of TCP and XTP is an interesting subject of research when we consider data 
transfer applications, where the performance seen by the end-user not only depends on the ATM 
and the AAL (ATM Adaptation Layer) layers, but it is also greatly influenced by the behaviour 
of the transport protocols. 

The interest in XTP is justified by the fact that XTP is a rate-bcised transport protocol de- 
signed for high-speed networks, in contrast with the window-based approach adopted by TCP. 
This important difference could have a significant impact on the performance figures obtained 
by the two protocols. A further major difference between TCP and XTP concerns the conges- 
tion prevention approach. Whereas congestion avoidance algorithms are embedded within the 
standard versions of TCP, no such feature is included into XTP. 

The objective of our study is to determine whether the two transport protocols offer signifi- 
cantly different performance when used in the context of ATM networks. In our study we did 
not consider the traffic control mechanisms proposed to efficiently manage best-effort services 
in ATM networks, namely the closed-loop control mechanism called ABR (Available Bit Rate) 
proposed in ATM Forum (1995) and the “fast reservation protocol” named ABT (ATM Block 
Transfer) proposed in ITU-TSS (1995). The main reason is that we are interested in looking at 
the basic behaviors induced on transport layer protocols by the segmentation process required 
at the ATM layer; introducing more sophisticated control techniques significantly improves the 
performance figures, but it could mask some of the phenomena we are interested in. Our sce- 
nario bears resemblance with the UBR service capability, although we do not consider here, for 
simplicity, techniques like selective cell discarding which are widely accepted in the UBR context. 

In this paper we adopt a simulation approach to study the performance that can be achieved 
by using either TCP or XTP for file transfer applications over ATM networks. We focus on 
the so-called bottleneck topology, where file transfer connections share a bottleneck link with a 
variable amount of background traffic. We first present results in a simpler scenario with either 
only TCP or only XTP connections, sharing the bottleneck link with Poisson background traffic, 
to highlight the main characteristics of the two protocols. Then we concentrate on a more bursty 
ON-OFF background traffic; later we modify the network span by considering variable length 
connections to assess the ability of the two protocols to deal with connections with different 
round-trip times. Finally, we consider a mixed scenario, in which some file transfer connections 
are carried by the TCP protocol, while others rely on XTP, to discuss the interaction between 
the two protocols when sharing network resources. 



2 THE SIMULATION MODEL 

The results presented in this paper are obtained with the CLASS ATM network simulator 
[see Ajmone Marsan et al. (1995), Lo Cigno and Munafo (1996)] that was developed at the 
Dipartimento di Elettronica of Politecnico di Torino in co-operation with CSELT (Centro Studi e 
Laboratori Telecomunicazioni, the research center of Telecom Italy) and the Technical University 
of Budapest. 
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In order to keep our simulation models as close as possible to the original behaviour of the 
protocols, instead of developing an abstract model for TCP and XTP, we adapted a simplified 
version of two distributed releases of the TCP and XTP protocols to run on top of the AAL5 
layer within CLASS. We chose the officially distributed BSD 4.3-reno release for TCP [see Van 
Jacobson (1990)] and the SandiaXTP 1.3 C-f+ code [see Distributed Systems Research (1994)] 
released by Sandia National Laboratories. 

We do not outline here the main characteristics of the TCP-reno congestion control algorithm, 
since the protocol is well-known [see Stevens (1994)], but we discuss the modifications required 
to adapt its code to the ATM environment. 

Our implementation includes all the important features of TCP reno, with the exception of 
the delayed ACK and the selective ACK options. Although the selective ACK option could have 
a remarkable influence on the simulation results, we do not include this option in our model, 
since most of the running versions of TCP do not include it. The delayed ACK option has a 
marginal effect in the scenarios we consider, and it is disabled. 

In order to allow a single connection to grab all the available bandwidth of a high-speed link, 
specially in a WAN environment, it is necessary to use a window of reasonably large size; hence 
the maximum window dimension is left as a parameter in our implementation. 

In present TCP implementations, the granularity of timers (the minimum value a timeout can 
assume) is rather coarse (roughly 500 ms). This granularity impacts the timeout setting and 
the RTT (Round Trip Time) estimation, and it must be reduced in order to allow the protocol 
to react fast on high bandwidth-delay product networks. As a rule of thumb, the granularity 
should be smaller than the propagation delay along the connection, and it should be equal 
for all connections, if fairness among connections sharing the same network resources has to 
be achieved. We left the granularity as a parameter in our model and we followed the above 
guidelines to set its value in all our simulation experiments. 

The Xpress Transport Protocol (XTP) was specifically designed aiming at high bandwidth- 
delay product and low error-rate networks. Since XTP is not a widely used protocol like TCP, 
we briefly outline here the main featuress of XTP version 4.0, with particular emphasis on the 
aspects that are most relevant from the modeling point of view. 

XTP supports both flow and rate control. Flow control is regulated by the receiver, and it 
is based on the amount of space available in its buffer. Rate control, one of the main feature 
of XTP, is enforced by the transmitter, through the use of the timer RTIMER, which is used to 
schedule the transmission of frames (the term used to refer to packets in the XTP specifications). 
An XTP source can initially transmit up to burst bytes; then it has to wait until RTIMER expires 
before being again allowed to transmit up to burst bytes, and so on; burst, the number of bytes 
that can be transmitted between two timer expirations is a simulation parameter. 

As we aim at comparing the rate-based control mechanism of XTP with the window-bcised 
flow control mechanism of TCP when both run over ATM networks, we would have liked to 
disable any traffic control mechanism other than rate control. Actually, disabling the XTP flow 
control, as allowed by the standard, and as we did in our simulation experiments, does not 
guarantee that rate control remains the only mechanism that regulates transmission speed. In 
fact, the sender keeps unacknowledged data in its output buffer to be able to retransmit them. 
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if necessary. When the network is long and congested, a significant percentage of XTP frames 
can be lost, and the output buffer becomes full because ACKs are not received; as soon as 
the transmitter output buffer becomes full, the transmitter must stop until at least one of the 
outstanding frames is acknowledged, so that the data it carried can be removed from the output 
buffer, and some space can be freed. Thus, the transmitter output buffer acts as a fixed-size 
window at the transmitter, introducing a kind of flow control mechanism. 

XTP retransmits unacknowledged frames based on either a Go-Back-N or a Selective Retrans- 
mission scheme. We will refer to these two versions of the protocol as XTP GBN and XTP SR 
in the remainder of the paper. 

The XTP specification does not state when ACKs and NAKs are to be sent by the receiver. 
ACKs (NAKs) are requested by the transmitter, and the receiver sends back ACKs (NAKs) as 
soon as it gets a request. Moreover, the transmitter can request the receiver to send a NAK 
if some data are missing (fast NAK option). After having requested an ACK, the transmitter 
starts a timer called WTIMER; the transmitter expects to receive the ACK (NAK) before the timer 
expires. WTIMER is initialized to a value based on an estimation of the round trip time similar to 
the one implemented in TCP. 

If the ACK does not reach the transmitter before the expiration of WTIMER, the transmission 
of data is suspended and a synchronizing handshake is started: the transmitter sends a request 
for an ACK that contains the status of the transmitter and demands an ACK containing the 
actual status of the receiver. If the reply is not received before the timer expires, a new request 
packet is issued, and WTIMER is doubled. The process is repeated until an ACK is received by the 
transmitter before the expiration of WTIMER. This terminates the synchronizing handshake; now 
each one of the two connection end-points precisely knows the status (which data bytes have 
been sent, which have been received, and which are missing) of the other, and the data flow can 
resume. 

Even though the XTP 4.0 specification does consider the possibility of having one timer per 
ACK request, it does not impose to implement them: a single timer which is overwritten at 
each new request can be used, and the Sandia 1.3 C-|--|- implementation adopts this option. If an 
ACK is requested before the previous one has been received (and before WTIMER has expired), the 
transmitter re-initializes WTIMER; thus, previous ACK requests cannot trigger a synchronizing 
handshake any more. 



3 THE SIMULATION SCENARIO 

The dynamics of protocols like TCP and XTP over ATM networks are clearly influenced by 
many different parameters, such as the network topology, its span, the buffer size available in 
the network nodes, and so forth. 

We investigate through simulation the performance of both TCP and XTP over ATM networks 
in different scenarios, trying to isolate behaviours due to the interaction of the protocols under 
study with the main ATM characteristics, from secondary effects due to other parameters. We 
base our study on connections in sustained overload resulting, for instance, from a very long file 
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Figure 1 The bottleneck topology. 



transfer or from an application requiring TCP or XTP to work in streaming mode. We assume 
that sources always have data to send; data are transmitted as fast as allowed by the control 
mechanism of the transport protocol. Each TCP and XTP connection^ is mapped onto an ATM 
virtual circuit. 

The topology we concentrate on in this paper is reported in Fig. 1 and we refer to this topology 
as the “bottleneck” topology in the remainder of the paper. Three transmitters are connected 
to one ATM switch, and the three corresponding receivers are connected to the other one. The 
transmitter/receiver pairs can adopt either TCP or XTP as their transport protocol. A variable 
amount of background traffic shares the link between the two ATM switches with the three 
connections under study. 

The data rate on all channels, both user- node and node-node, is set to 150 Mbit/s, and the 
size of the buffers inside ATM switches is set to 1000 cells. The data flow of the TCP (XTP) 
connections is assumed to be unidirectional: transmitters send data segments, and TCP (XTP) 
receivers return only ACKs; TCP ACKs fit in one cell, while XTP ACKs are two cells long 
(ACKs) or more (NAKs). Losses are avoided both in the user transmission buffer and at the 
receiver: they can occur only within ATM switches. 

We take as simulation parameters the maximum TCP window size W and the dimension of 
the “useful data PDU”, i.e., the number of user data bytes that are transmitted within each 
TCP segment or XTP frame. We take as reference value 9140 bytes, that, when the TCP, IP 
and AAL5 overheads are added makes an AAL5 CS PDU of 9188 bytes, which complies with 
the suggested maximum segment size for IP over ATM. With the 32 bytes of the XTP header 
and the AAL5 overheads, XTP sources generate AAL5 CS PDUs of 9180 bytes. We will use the 
term PDU to identify either TCP segments or XTP frames and the acronym MPS to identify 
the Maximum PDU Size in our simulation runs. The TCP code was slightly modified so that 
only PDUs of size MPS are transmitted. As additional MPS size we use the value 1142 which is 
1/8 of the reference MPS measured in ATM cells when headers are included. Note that actual 
implementations of TCP use a very small MPS for inter-LAN traffic, typically 512 bytes; this 



t More precisely we should use the term association for XTP; we will refer to XTP associations as XTP connections 
for simplicity, adopting the TCP terminology 
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limitation is mainly due to the available IP routers, and it will probably disappear in future 
implement at ions . 

XTP sources are rate controlled and generate traffic at an average rate of 50 Mbit/s. The 
burst length burst of the rate controller is set to one frame because high burstiness is usually 
critical in ATM networks. Thus, our XTP sources transmit one PDU and then wait for RTIMER 
to expire. 

The XTP flow control is disabled, but the size of the transmission buffer is set to the same value 
of the TCP window. Our XTP transmitter implementation requests ACKs at regular intervals, 
and in the simulation presented in this paper an ACK is requested for each transmitted PDU, to 
be consistent with the TCP behaviour. Concerning error control in XTP, the fast NAK option is 
always used, and both the Go-Back-N and Selective Retransmission schemes are used, in order 
to compare their effectiveness in the various scenarios. 

Simulations were run both with and without a traffic shaper operating at the ATM level at 
50 Mbit/s. The shaping device operates according to a modification of the Generic Cell Rate 
Algorithm described in ITU-TSS (1995). 



4 NUMERICAL RESULTS 

The main performance figures we present in this study are the goodput^ i.e., the average effective 
amount of user data that are transferred from the transmitter to the receiver, discarding all 
overheads, faulty PDUs, and possible retransmissions, and the efficiency^ i.e., the ratio between 
the goodput and the user offered load. The goodput is a measure of how the end user would 
perceive the performance of the network. The efficiency, on the other hand, is a measure of 
how well the network is exploited. If not otherwise stated, the performance figures are averaged 
over the 3 connections: the actual performances of individual connections lie within 1-2% of the 
average. 

We consider two types of background traffic: Poisson and ON-OFF. The Poisson background 
traffic results from the segmentation of user messages, generated according to a Poisson process, 
with a truncated geometric message length distribution whose mean is equal to 20 cells and 
whose maximum is equal to 200 cells. The background traffic is then shaped with an allowed 
peak cell rate equal to 1.2 times the average cell rate, thus introducing only a small burstiness. 
The ON-OFF background traffic is generated by a single ON-OFF source, sending a CBR cell 
flow during ON periods, with geometrically distributed ON and OFF period lengths. We consider 
two possible values for the average duration of the ON and OFF periods, namely 20000 and 10000 
cells in the first case (we will refer to this background as fast ON-OFF) and 200000 and 100000 
cells in the second one (slow ON-OFF). On average the ON-OFF background traffic source 
transmits for 2/3 of the simulation time; the transmission rate during the ON period is set to 
the value that produces the required average background traffic load. This type of traffic, which 
could be considered quite unrealistic, allows us to discuss phenomena and behaviours typical of 
transient phases using our time-averaged performance indices. 

The following subsections present first the results obtained using 3 TCP or 3 XTP connections. 
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Figure 2 Aggregate average goodput and efficiency for TCP connections in the 1000 km bot- 
tleneck topology, as a function of the Poisson background traffic load, the maximum window W 
and MPS. 



when a Poisson background traffic is considered. Then we discuss the situation when an ON- 
OFF background traffic is adopted. Later, variable length connections are studied, and finally a 
mixed XTP-TCP scenario is considered. 

4.1 TCP connections with Poisson background traffic 

Although a number of analyses of the performance of TCP connections on ATM networks al- 
ready appeared [see Romanow and Floyd (1994), Bianco (1994), Ajmone Marsan et al. (1996), 
Lakshman et al. (1996), Li et al. (1996)], it can be useful to discuss the impact of such basic 
parameters as the maximum window and segment size. Fig. 2 reports the aggregate performance 
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for 3 TCP connections with length 1000 km, as a function of the background traffic load. Each 
plot reports the aggregate goodput (solid lines) and efficiency (dashed lines), both in the case 
when no shaping is applied (square markers) and in the case when TCP transmitters shape their 
traffic at 50 Mbit/s (round markers). The plots in the left column refer to the case of “optimal” 
window size, when the maximum TCP window size in bytes Wb = MPS • W is set so as to 
allow each connection to exploit roughly 1/3 of the link bandwidth when no background traffic 
is present, disregarding queueing delays in the ATM switch buffers. Plots in the right column 
consider a 3 times larger value for Wb, either accounting for the possibility for each connection 
to exploit the whole link capacity, or to maintain a high throughput even if the queueing delays 
increase the RTT estimate. The two rows refer to results obtained with different maximum PDU 
size in bytes. The upper row refers to the case of MPS = 9140 which is the recommended maxi- 
mum segment size for IP over ATM networks, while the lower row refers to the case MPS = 1142; 
the maximum window size in segments W is increased so as to keep the dimension Wb roughly 
constant. 

The curves in the left column show that, when the background traffic is light and the maximum 
TCP window is optimized, the network resources are very well exploited. The TCP goodput is in 
fact very close to the available free capacity (one third of the link capacity minus the background 
load, as indicated by the dot-dashed line), and the efficiency of the transmission is always 1. For 
higher loads, the network becomes lossy and both goodput and efficiency sharply drop. 

Comparing the left and right columns, it is quite evident that the maximum window setting 
has a remarkable impact on the performance that can be obtained by TCP: if Wb is too large, 
the performance of TCP becomes very poor. A larger window implies, in fact, the possibility of 
transmitting at higher speed, but also a more aggressive policy in the window size increase algo- 
rithm: under these conditions, the actual transmission window oscillates at a higher frequency, 
and its average value, that determines the obtained throughput, is smaller. Besides, it must be 
noted that very large windows generate congestion even when the background load is very light, 
a fact that further decreases the performance of the network. A preliminary conclusion is that 
in order to exploit network resources well, the TCP window must be carefully determined: an 
overestimation of the window size can be quite dangerous. 

The effect of the PDU size on goodput is negligible; however, if we consider efficiency, it is 
quite clear that the smaller the PDU the better the efficiency, specially when the network is 
overloaded. Clearly, the PDU size cannot be reduced too much, otherwise the overheads would 
become too large. It may be worthwhile to stress the importance of efficiency; it must be taken 
into account that a sharp drop in efficiency indicates most of all that the network is bringing 
to destination cells that are worthless, either because they belong to damaged segments, or 
because they are unnecessarily retransmitted. As a consequence, the efficiency is an indirect but 
significant measure of the effect of transport protocol connections on background traffic. In fact 
the link load p can be obtained summing the background load to the TCP or XTP goodpud 
divided by the efficiency, thus for a give goodput a low efficiency of TCP or XTP implies that 
the network is more havily loaded and hence the background traffic suffers in terms of higher 
cell loss and delay jitter. 

As a final consideration, we observe that, in this scenario, shaping the TCP traffic does not 
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Figure 3 Aggregate average goodput and efficiency for XTP connections in the 1000 km bot- 
tleneck topology, as a function of the Poisson background traffic load, the maximum window W 



and MPS. 



significantly alter performance, possibly even having a negative effect. This is due to two different 
reasons: i) traffic shaping can improve the performance when short term congestion is the key 
issue, while in the considered case the congestion is due to a long term effect, ii) spreading the 
cells of a single segment over time, shaping makes it more likely that one of the cells is lost due 
to the background traffic fluctuations. 

4.2 XTP connections with Poisson background traffic 

In Fig. 3 we consider a scenario quite similar to the one that was just discussed, but now XTP 
is the transport protocol. Results refer to a MPS equal to 9140 bytes in the upper row, and 1142 
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in the lower row. Both the Go-Back-N (left column) and the Selective Retransmission (right 
column) schemes are studied. Each plot shows curves for both the aggregate average goodput 
and efficiency as functions of the background traffic load, in the cases of shaped (round marker) 
and unshaped (square marker) XTP traffic. 

First of all, we observe that shaping does not significantly modify the performance of XTP 
connections. This result is not surprising, since the main characteristic of XTP is that it is rate- 
controlled at the PDU level; it is thus reasonable that a further rate control at the cell level has 
a minor impact on performance. 

When the background traffic is low, XTP connections succeed in exploiting most of the avail- 
able capacity (indicated by the dot-dcished line) without any PDU loss (the efficiency equals 1). 
As the background load grows, both the goodput and the efficiency of XTP connections drop 
sharply, due to the fact that the only XTP mechanism to react to congestion, i.e. the synchro- 
nizing handshake, cannot work properly with the given values of RTIMER and RTT. Indeed, XTP 
sources transmit regularly-spaced PDUs at the average rate of 50 Mbit/s; when a synchroniz- 
ing handshake starts, the transmitter stops sending data PDU, thus reducing the load on the 
network. In our scenario, the RTT is very large compared to RTIMER, i.e. the time between two 
consecutive ACK requests. Thus, each ACK request causes WTIMER to be overwritten, and the 
synchronizing handshake not to be triggered. In other words, if the output buffer is at least as 
large as the connection pipe and the network is not congested (no PDU is lost), the transmitter 
never stops sending data because as soon cis ACKs arrive the corresponding room in the output 
buffer is freed, so that such buffer does not fill up. 

In our simulations, the output buffers of XTP sources are slightly smaller than the connection 
pipe, considering empty buffers in the switches; thus, each XTP source transmits a full output 
buffer of data, and then it stops and waits for the ACKs; after a time period shorter than WTIMER 
has been elapsed, ACKs start reaching the transmitter. 

With zero background traffic, the XTP sources can send the amount of data equivalent to a 
full output buffer without congestioning the network: the efficiency is 1, but since some time is 
wasted waiting for ACKs, the bandwidth used is smaller than the raw channel data rate. 

When the background traffic is low, each XTP source transmits a full output buffer of data 
without overflowing the buffer of the first switch (efficiency is one and no PDUs are lost), but 
RTT increases, and so does the time spent waiting for ACKs. The increase in the amount of 
time spent waiting for ACKs is beneficial, since it is reflected in a decrease of the average load 
imposed on the network, thus avoiding congestion. Moreover, it is unlikely that no ACK reaches 
the transmitter before the WTIMER expires, and thus synchronizing handshakes can be avoided. 

As the background traffic increases, the buffer in the ATM switch cannot store a full output 
buffer of data from each XTP source, and some PDUs are lost. As soon as one PDU is received 
out of order, the receiver sends back a NAK identifying lost data. The transmitter reacts to the 
NAK by immediately retransmitting lost data; this further loads the network, especially when 
the XTP CBN is considered. The more congested the network, the larger the amount of lost 
data; thus the receiver does not receive PDUs and cannot send back ACKs or NAKs. When the 
sender buffer is full, it stops until WTIMER eventually expires and the transmitter at la^t stops 
overloading the network for a RTT (needed to complete the synchronizing handshake). Then, 
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Figure 4 Aggregate average goodput and efficiency for XTP and TCP connections in the 
1000 km bottleneck topology, as a function of the fast ON-OFF background traffic load, the 
maximum window W and PDU MPS size. 



the same sequence takes place again; this phenomenon generates the sharp drop of goodput and 
efficiency that appears in the left column plots of Fig. 3. When XTP SR is analyzed, the decrease 
of goodput and efficiency is smoother, as shown by the plots in the right column of Fig. 3. 

Thus, the low performance of XTP is mainly due to the inadequacy of the XTP mechanisms 
to cope with congestion: XTP is not able to promptly react to congestion because it slows down 
only after having sent a full output buffer of data; this stems from having RTIMER always expiring 
sooner than WTIMER, and the latter being overwritten. 

4.3 Comparison between TCP and XTP with Poisson background 
traffic 

The differences between the TCP and the XTP connections behaviour is not striking in this 
scenario. The TCP performance is significantly influenced by the choice of the window size; 
when an almost optimal window is used, the goodput obtained by TCP is slightly better than 
the corresponding figures for XTP GBN. The efficiency of TCP connections is always much 
better than the one of XTP, due to the adaptive congestion control algorithm. Increasing the 
MPS significantly worsen the efficiency for TCP, while the goodput behaviour is not dramatically 
influenced; only a slight increase for some values of background load can be noticed for bigger 
MPS. Increasing the MPS has a beneficial effect for the XTP goodput. As expected, XTP SR 
behaves better than both TCP and XTP GBN. Finally, shaping at the ATM layer has a marginal 
effect on both protocols. 

Simulations were run also in the case of a 10 km network. We do not discuss the results in 
this paper, but we outline here the main conclusions that can be drawn from them. 

The first observation is that, when the network is very short, in order to obtain good perfor- 
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Figure 5 Aggregate average goodput and efficiency for XTP and TCP connections in the 
1000 km bottleneck topology, as a function of the slow ON-OFF background traffic load, the 
maximum window W and MPS. 

mances, both TCP and XTP must behave similarly to a stop-and-wait protocol, i.e. a very small 
window is required to obtain a significant goodput with acceptable efficiency. XTP connections 
can obtain a very high goodput and efficiency thanks to a particular mechanism of synchroniza- 
tion between the RTIMER and WTIMER due to the very short network span. The phenomenon is 
the following: WTIMER expires much earlier that RTIMER, hence if a frame is lost or delayed too 
much, a synchronization handshake starts and finishes before (or soon after) RTIMER expires, so 
that the transmission is basically not affected by the frame loss. This peculiar behaviour is very 
critical and can be easily disrupted, so that we can conclude that also in this scenario the two 
protocols behave quite similarly. 

4.4 TCP and XTP connections with ON-OFF background traffic 

We concentrate now on the bottleneck topology with either 3 TCP or 3 XTP connections sharing 
the bottleneck link with ON-OFF background traffic. In Fig. 4 (5) we present results when the 
background traffic is generated by a fast (slow) ON-OFF source. Goodput and efficiency averaged 
over the three connections are reported for TCP, XTP GBN and XTP SR, using round, square 
and triangle markers, respectively. 

We first observe that, as expected, XTP SR behaves slightly better than both TCP and XTP 
GBN, even if, when MPS=9140, the difference is minimal. 

TCP always exhibits better efficiency than XTP, and comparable goodput performance, except 
when the background load is very high with fast ON-OFF background; in this case, the slow 
increase control algorithm penalizes TCP connections with respect to XTP. 

When comparing the slow and fast ON-OFF scenarios, it can be observed that the slow ON- 
OFF scenario always allows connections to obtain better performances, since the connections 
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have longer periods in which the background traffic is regular. When the background is switched 
on and off more rapidly, the transport protocol control mechanisms must cope more frequently 
with the variable network load; during these transient phases, connections are not able to fully 
exploit the available bandwidth and the obtained goodput is reduced. TCP benefits of this effect 
at high load, when the slow ON-OFF background generators are considered, since its congestion 
control algorithm has enough time to increase the transmission window to a value that allows 
connections to exploit the available bandwidth; performance figures of TCP in this case are 
almost identical to those of XTP. 

In general, when the ON-OFF background traffic is considered, it can be observed that the 
goodput is higher than in the Poisson background traffic case, since during the OFF period 
connections are able to grab the full link bandwidth; XTP connection efficiency also benefits of 
this phenomenon. 

Also in this scenario the differences among TCP and XTP are not astonishing, and TCP still 
provides better efficiency; this implies that its more controlled behaviour has a beneficial impact 
on the background traffic. 

4.5 Variable length connections 

We consider here either 3 XTP connections or 3 TCP connections with variable lengths sharing 
the bottleneck link with Poisson background traffic; connections lengths are 10, 100 and 1000 km 
(square, round and diamond markers), and the MPS is set to 1142 bytes. 

We first discuss the results reported in Fig. 6, referring to 3 XTP connections; left pictures are 
obtained without shaping at the ATM layer, while a 50Mbit /s shaping is enforced in the pictures 
in the right column. The upper row refers to XTP GBN, the lower row to XTP SR. The behavior 
of the curves in the left plots is rather puzzling, with the shortest connection performing very 
poorly at low loads. 

This behavior is due to the RTIMER and WTIMER synchronization effect observed in very short 
networks, that in this case slows down the short connection, allowing the longer two to grab more 
bandwidth. However, if shaping is applied to the connections, the particular synchronization 
between RTIMER and WTIMER is destroyed, and the performance is much more predictable, with 
the longest connection performing poorly due to long waits and more retransmissions. 

The bias against longer connections is roughly linear with the RTT estimation, so that the 
difference between the 100 km and the 1000 km connections is much more evident than the one 
between the 10 km and the 100 km connections. As always XTP SR performs better than XTP 
GBN. 

Fig. 7 refers to TCP connections. In the top left plot we report results for unshaped TCP 
connections; all the connections have the same maximum window size equal to 48 segments. We 
can clearly observe the well known biased behaviour of TCP towards longer connections, due to 
their higher RTT that forces longer connections to increase their window size at a slower speed. 
Increasing the background traffic reduces this effect, since queueing delays tend to reduce the 
differences among the RTT experienced by the connections. In the top right corner this biased 
effect is smoothed by applying shaping at the ATM layer, mainly because now all the connections 
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Figure 6 Aggregate average goodput and efficiency for XTP variable length connections, as a 
function of the Poisson background traffic load. 



offer on average 50Mbit/s. Also, shaping requires buffering at the UNI so that the 10 km and 
100 km connections experience a similar round trip time delay. 

In the remaining four plots in Fig. 7 we try to regulate the maximum window size W so that 
all connections are close to offer a nominal load of 50Mbit/s. We first set, in the center left 
picture, the maximum window size to 2 segments for the 10 km connection, to 5 segments for 
the 100 km connection and to 48 segments for the 1000 km connection. These window sizes 
limit the capability of shorter connections to grab the available bandwidth; as a result, the 
longer connection obtains a higher throughput, and the situation is reversed with respect to the 
previous case. The center right frame refers to the same window sizes, but shaping is applied; 
again, shaping helps in obtaining better fairness. 
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MPS=1 142, W=48, TCP Unshaped MPS=1 142, W=48, TCP shaped at 50Mbit/s 
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Figure 7 Aggregate average goodput and efficiency for TCP variable length connections, as a 
function of the Poisson background traffic load and the maximum window size W. 
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Finally, in the left and right bottom plots, we try to fine tune the window size of shorter 
connections, that were the limiting factor in the previous scenario. The results suggest that a 
careful window setting could help in obtaining fairness among connection not controlled at the 
ATM level, although this fine tuning of the window is probably very difficult to perform in real 
time, and, in any case, will be useful only for a limited range of network loads. Moreover, any 
window tuning is useful only for long connections, i.e. not in a LAN environment, in which we 
are limited by the fact that the window size should at least be equal to two segments; to further 
reduce the offered load we need to reduce the MPS, Maximum PDU Size. 

As a general remark, both TCP and XTP are naturally prone to unfairness when comparing 
performances obtained by connections with different lengths. Shaping at the ATM level could 
help in reducing this unfair behaviour. For TCP connections, a careful setting of the window size 
together with an appropriate shaping at the ATM layer, allow a good fairness to be obtained. It 
is also significant that while the efficiency is reasonably high for TCP connections, it drastically 
drops for XTP connections as soon as the background load is increased. 



4.6 Mixing TCP and XTP connections 

In this last scenario, we simultaneously consider 3 XTP and 3 TCP connections in the 1000 km 
bottleneck topology with Poisson background traffic. In order to maintain the same network 
load, we reduced by half most of the connections parameters (like the window size, the XTP 
transmission rate and so forth) with respect to the previously examined scenario. As a results, 
also the average aggregate goodput ranges within roughly halved boundaries. 

We report in Fig. 8 the aggregate XTP (white markers) and TCP (black markers) efficiency 
and goodput, averaged over the three connections. Regardless of the MPS and of the chosen XTP 
version (being it GBN in the first row, and SR in the second row) the main phenomenon that 
can be observed is that TCP connections obtain a much lower goodput than XTP connections. 
This is due to the fact that TCP tries to regulate its offered load via the adaptive window 
mechanism. As soon as TCP detects congestion in the network, it reduces the offered traffic by 
shrinking its transmission window. XTP connections can easily grab the bandwidth left by the 
TCP connections since their transmission speed is reduced only via the synchronizing handshake 
procedure. At the end of the handshake, XTP immediately resumes transmitting at its fixed rate, 
while TCP connections are forced to maintain their window small, since the network remains 
congested. TCP connections, which are trying to cooperate in order to relief network congestion 
obtain a low goodput and even a worse efficiency than XTP. 

It must be observed that if traffic control mechanisms at the ATM layer are adopted, this biased 
behaviour should be easier to control; but, if the transport layer traffic control mechanisms come 
into play, TCP obtains worse performance due to its embedded window control mechanism. 
This very simple example, however, shows that some very unpleasant phenomena could arise if 
different protocols are mixed together without proper control. 
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Figure 8 Aggregate average goodput and efficiency for 3 XTP and 3 TCP connections sharing 
the bottleneck link with Poisson background traffic. 



5 CONCLUSIONS 

The performance of two transport protocols, TCP and XTP, when used to support file transfers 
over ATM networks, was investigated through simulation, using CLASS, a simulation tool for 
the study of ATM networks at the cell and burst levels. 

Several numerical results were derived for a number of different sets of system parameters. The 
most general indication that can be obtained from the numerical results is that the performance 
differences between the two protocols are not striking. This is quite a good point in favor of 
TCP, whose diffusion is today extraordinarily wide. It thus seems that the main motivation for 
a possible replacement of TCP in very high-speed networks should not be rooted in performance 
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issues, but might rather be based on the need for simpler (hence less time consuming) algorithms 
at the transmitter and the receiver. 

Both protocols tend to have unfair behaviours when several sources with different RTT share 
network resources. Shaping at the ATM layer reduces this phenomenon, but it is rather ineffective 
in the other considered scenarios. 

Finally, users running TCP in a UBR-like context similar to the one presented in this paper 
suffer lower performance when compared with users relying on XTP. The bias against TCP 
users is quite clearly visible in a UBR context, but the picture could be significantly modified 
considering more elaborate (ABR-like) traffic control techniques. 
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Abstract 

This paper addresses the problem of transmission of digital Video communications over B-ISDN 
and provides a solution based on a good knowledge of both video system design and broadband 
networks capabilities. We propose a picture-oriented QoS control framework to ensure a high end- 
to-end Quality of Picture (QoP) level for MPEG-encoded Video-On-Demand (VOD) applications 
carried over ATM networks. Our approach overcomes the difficulty imposed by random cell 
discarding due to the bursty and Variable Bit Rate transmission nature of compressed video, by 
using an accurate cell discrimination strategy and destination data recovery mechanisms. It is 
assumed that there are four hierarchical types of cell flows according to MPEG coding. Each flow 
has a different impact on the destination picture recovery process. Therefore, this scheme aims to 
minimize cell loss probability for critical data and thus, reduce the destination picture quality 
degradation, by adjusting the priority level of cells according to the carried data type. Despite 
these preventive measures, cell loss or errors may still occur. Therefore, complementary actions 
are cooperatively taken at the source and destination equipment (i.e., codecs), in conformity with 
the temporal requirements of MPEG video streams. In the one hand a fast cell-error detecting and 
recovery algorithm is proposed for protecting critical picture system information. In the other 
hand, a block-based bit interleaving schemes are designed to spatially disperse cell loss effect on 
reference frames. Finally, an extended AAL type 5 is presented to support such a high 
performance picture delivery framework. 
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1 INTRODUCTION 

Asynchronous Transfer Mode (ATM) technology was initially designed to permit the 
simultaneous canying of various services, all supported equivalently without any knowledge of 
the cell content i.e. without discriminating some cells against others. Among all the services, 
multimedia applications and especially digital video communications should contribute as a major 
fraction of the overall traffic. These sources are inherently variable, asynchronous, stochastic and 
non-stationary. 

Today, it is accepted that residential audio-visual applications, such as TV broadcasting 
and Video-On-Demand (VOD), will work on the basis of high-speed ATM multiplexing and 
switching techniques and will be necessary compressed to optimize network resources utilization 
(i.e., bandwidth, buffers). Therefore, several coding schemes have been candidates to be the 
Broadband-ISDN multimedia coding standard. These video compression algorithms are mostly 
based on efficient Discrete Cosine Transform (DCT) and block oriented techniques. But, they are 
today seriously threaten by new families of coding techniques such as region-based coding, 
frequency decomposition or fractale transform. Among the former family, three world-wide 
standards were considered for use within ATM networks : ITU H.261, ISO JPEG (Joint Picture 
Expert Group) and ISO MPEG-1 (Moving Picture Expert Group). Unfortunately, they all have 
been designed to address the needs of specific services and applications, and they can not 
interoperate or run without restriction over ATM-based Broadband networks. 

For instance, the H.261 scheme was initially designed to operate over specific digital 
networks (e.g. Narrowband-ISDN) with low bit rates about nx64 Kbit/s to carry teleconferencing 
applications where motion is naturally more limited. Similarly, JPEG algorithm only emphasizes 
still images and do not apply any inter-coding algorithm to take into account temporal redundancy 
of moving picture sequences. Finally, processing delay was not a major concern for MPEG first 
phase {MPEG-1), as it was not expected for use in conversational services but adapted for storage 
and retrieval applications of moving pictures from CDs in a loss-less environment [1]. 

Consequently, both ITU SGI 5 and ISO MPEG are collaborating to produce new higher 
quality video coding standards for B-ISDN, which are respectively named ISO MPEG-2 and ITU 
H.262 [2]. For accelerating the availability of native ATM multimedia services, the ATM Forum 
SAA working group has proposed the mapping of MPEG-2 transport packets into AAL5 PDU [3], 
and is also currently addressing the specification of a new dedicated AAL6 for lowest bit rates 
MPEG-4 coded applications. MPEG-4 scheduled for 1998, is designed for a whole spectrum of 
very low bit rates applications. These video applications will have sampling dimension up to 176 
X 144 X 10 Hz and coded bit rates between 4800 and 64,000 bit/s (e.g. video-phony, mobile audio- 
visual communication, or multimedia electronic mail). 

However, the use of such compression algorithms will change the flow transmission 
characteristic, from CBR to VBR. This will inherently cause a higher cell loss probability, a 
greater cell delay variation and will increase network management complexity. Since cell loss is a 
major problem for compressed data streams, defensive measures has to be performed at the edge 
(i.e., codecs) and into the network (i.e., switches) to maintain a minimum display picture quality. 
Despite of this demand, few QoS control schemes have been proposed to take into account the 
specific characteristics of bursty MPEG-encoded video streams. Most of today’s traffic control 
researches emphasize ABR and UBR "Best effort" services, as defined in ATM Forum 
specifications [4]. 




Picture quality control for MPEG video over ATM 



51 



Therefore, we propose a video-oriented QoS control framework for bursty video services, 
which is based on the layered data structure of MPEG streams. The framework allows the design 
of a multi-level priority policy coupled with fast data-loss recovery mechanism targeted for either 
system control and picture information. This approach allows the network control frmctions 
(Usage Parameter Control, Priority Control, Resource Management, ...) to discriminately react in 
case of congestion by preserving critical data from discarding, and performing picture 
concealment at the destination terminal in case of missing data. 

This paper is organized as follows. We briefly review in section 2, the main factors 
affecting the Quality of Picture of compressed video applications. The third section is devoted to 
our multi-level priority strategy and the description of the different components of the QoS 
control framework to guarantee a low picture quality degradation in both situations of data loss 
and bit-errors. A particular focus is made on network internal policies (i.e., multi-level cell 
priority and intelligent discarding frmctions) as well as edge proactive and reactive actions (resp. 
error recovery scheme and picture concealment through block interleaving). Finally, concluding 
remarks are given in section 4. 



2 FACTORS AFFECTEVG PICTURE QUALITY 

From a network control standpoint, applications and underlying generated bit flows are commonly 
grouped in two service classes according to their traffic characteristics and sensitivity. The first 
class is related to error and loss-sensitive flows like traditional LAN traffics. The second class 
concerns delay sensitive applications, which are usually referred to as real-time services (voice, 
non compressed video, ...). 

With real-time encoded video applications, this artificial border between reliability and 
stringent temporal requirements are not respected. Indeed, video applications using compression 
capabilities are submitted to both error-free and real-time transmission constraints. Consequently, 
controlling the quality of service (QoS) of such demanding applications is widely considered as 
one of the most important challenge to be faced before introducing Broadband-ISDN multimedia 
services on a widespread basis. 

It is thus necessary to precisely analyze the various networking factors interfering with 
audio and picture quality degradation. In the following, we emphasize data losses and error 
effects. 

2.1 Data losses due to cell errors 

Random bit errors occurring along the communication path or within the network nodes due to 
electrical or physical problems can highly damage the quality of the decoded pictures. At the cell 
level when such bit errors occur in the header, the cell is either mis-delivered when errors and 
address modifications are undetected, or discarded by the physical layer or by the receiver in case 
of uncorrectable detected errors. In both cases the whole cell should be considered as lost and the 
consequences can be serious for the MPEG decoding process. If the error occurs in the payload 
type of the cell, the damage is obviously limited to the degradation of part of the MPEG Packet. If 
this part belongs to the MPEG Transport Stream (TS) packet header the entire packet may be lost 
and the impact on the displayed pictures can be also very serious. Fortunately, the probability of 
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such data losses is normally extremely low. For instance, in high-speed networks based on optical 
fibers, it is not exceeding l0"‘^ Nevertheless, even for these transient error events, new 
mechanisms (i.e., error detection and correction schemes) are required at the MPEG level to 
ensure a low video quality degradation. 

2.2 Data losses due to burstiness and excessive delays 

With regards to real-time video service, variable bit rate (VBR) transmission has several 
advantages over conventional constant bit rate (CBR) mode. Among the advantages, there are : 
consistent and subjectively better video quality, simpler encoder design and increased 
multiplexing gain of a factor of two and even four with cell loss probability of 10”^[5][6]. 
However, VBR transmission mode is an important cause of data losses due to peaks in traffic and 
subsequent switch buffers overflowing. These heavy loads are mostly due to inadequate network 
resources allocation and multiplexing processes. Exceeding network capacity leads to the cell 
discarding by either the congested nodes (through UPC) or the destination terminal if the delay 
exceeds a threshold. In the latter case, the MPEG packet arrives too late for playback on the 
terminal. Both cases leads to a loss of data in units of the MPEG Transport packets, which is 
unacceptable. Preventive actions must then be applied to minimize the QoS degradation, inside 
the network through intelligent discarding and, at terminal nodes, through fast data recovery 
schemes at both ATM and MPEG transport levels. 



3 A PICTURE QUALITY CONTROL FRAMEWORK 

MPEG scheme handles three frame types which are distinguished by the used compression 
method : Intra-coded (I frames). Predictive coded (P frames) and Bi-directional predictive coded 
(B frames). I frames are encoded using only a spatial DCT compression and do not refer to any 
other frames. P frames use DCT and a simple Motion Compensation (MC) with the reference to 
the previous I or P frames. Finally, B frames use DCT and MC Interpolation with respect to the 
previous I or P frames and the next I or P frames. In order to increase coding efficiency, a typical 
arrangement is composed with periodic interleaving of frames from each mode. This deterministic 
periodic sequence is called Group Of Pictures (GOP) and depicted by Figure 1. 
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Figure 2 - Impact of coding modes on frame 
size. 



Figure 1 - Group Of Pictures (GOP) 
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Regarding to their importance for the decoding process and thus on the displayed picture quality, I 
frames have to be send with a guaranteed delivery service. Similarly, clock references and system 
data, located at the TS and Packetized Elementary Stream (PES) headers, have to be protected 
from loss or errors to allow the destination system controller to segregate and synchronize the 
audio and video sub-flows. Assuming this priority hierarchy between MPEG data, it is our goal to 
propose a video-oriented approach for traffic management and congestion control operations to 
better take into account the delay and loss sensitive characteristics of MPEG-encoded 
applications. Therefore, dedicated proactive and reactive QoP control policies are described in the 
following sections. 

3.1 Proactive Network Control policies 

3.1.1 Protection by shaping and policing targeted for encoded video streams 

The call admission control (CAC) function is not sufficient to prevent cell discarding and upper- 
layer MPEG packet loss. Indeed, users may not respect the connection parameters negotiated at 
the call set-up (Average Cell Rate, Peak Cell Rate, Sustainable Cell Rate, Maximum Burst Size, 
...). Network operators must then, ensure that the sources stay within their connection parameters. 
These functions, referred to as traffic shaping and traffic policing, should detect non- conforming 
cells generation and set appropriate actions to minimize the negative effect of traffic excess. 
Several actions are proposed in ITU-T 1.371 recommendation [7], dropping, delaying, or marking 
the non-conforming cell and/or notifying the violating source to reduce its bit rate (Explicit 
Backward Congestion Notification). 

Due to the bursty characteristics of encoded video streams, the buffering of cells before 
they enter the network (i.e., shaping will significantly reduce bandwidth utilization but may 
extend delays to unacceptable limits. For instance, at the start of a Video-on-Demand session, a 
fast load is performed of the first few minutes of a movie from the selected Video Server. This 
action requires a large bandwidth, e.g., 100-150 Mb/s, to be allocated in the network during few 
seconds. This provides the user with quasi immediate access to the service, but at the same time 
the video signal rate of the remaining parts of the movie decreases to several Mb/s (about 2-8 
Mb/s). Similarly, random cell discarding (i.e., policing) can cause serious degradation of the 
picture quality. 

It is our goal to provide a video-oriented approach for traffic management and congestion 
control operations to better take into account the delay and loss sensitive characteristics of MPEG- 
encoded applications. 

3.1.2 A video-oriented priority scheme and an associated discarding policy 

As discussed previously, indiscriminate cells dropping leads to a visible damage at pictures 
display. This can be overcame by use of an effective four-state priority scheme associated with an 
intelligent discarding strategy and based on MPEG multi-level data structure. 

To avoid a congestion worsening or an excessive jitter, a lower priority cell, in the MPEG 
coding sense, is dropped (rather than delayed) and its buffer space is given to a higher priority 
cell. For our purpose, we focus on a shared buffer approach. The results in [8] indicate that the 
performances are increased but at the cost of a higher complexity of an ATM switch management. 
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A multi-level priority strategy, named Early Packet Discarding (EPD), has been 
previously proposed for discarding a set of cells. For packet-oriented services when an 
intermediate node discards a cell, all the following cells associated with the same packet have no 
use to be transmitted anymore and can be removed. In our case, we adapt such a capability to 
discriminate the cells belonging to the different MPEG frame types (I, P, B, System control) and 
which are candidate for discarding. During congestion, groups of B-cells between either P or I- 
cells are firstly dropped. This will statistically reduce overload buffers occupancy by a ratio of 
about 13 %. This allows network control and management fimctions to ensure a graceful picture 
quality degradation for MPEG connections. We should notice that EPD mechanism, in relation 
with AAL5 protocols, uses a marked bit belonging to the PTI field to determinate ends of data 
packets. 

To implement priority management capabilities, ITU-T has proposed the use of the one-bit 
CLP mechanism to distinguish between two types of cell within a VC stream. Actually, three 
types of priority cells can be defined according to the value of the CLP bit : CLP=0 for high 
priority cells, CLP=1 for low priority cells and CLP=0+1 for tagged cells. 

Based on this analysis, we are able to define two types of priority levels in ATM networks: 
a "service-oriented" priority and a "congestion-oriented" one. The former reflects the relative 
importance of one cell over another (user point of view), while the latter is generally assumed by 
the UPC/NPC mechanisms (traffic control point of view). According to the ATM cell structure a 
conflict arises when the CLP bit is used to implement these both levels of priority assignments. 
Consequently, there is no possible distinction between the service-oriented (vital vs. non-vital) 
and congestion-oriented (conformed vs. non-conformed) markings of cells [9]. 
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Table 1 - ExCLP and MPEG data mapping 



To overcome this difficulty, our control scheme applies a four-state priority strategy using 
two bits. The first one is the classical CLP bit and the second one is the adjacent bit belonging to 
the PTI field (Figure 3). This new field, referenced as Extended CLP (ExCLP), allows each CLP 
and PTI bit to singly assume their original meanings according to standards [10]. Four priority 
classes are defined and MPEG data types are mapped as shown in Table 1 . 

With respect to the MPEG compression algorithm, the System data class, which regroups 
audio data and the following MPEG headers (Transport Stream, PES, Sequence, GOP, Frame, 
Slice and Audio) are considered as critical information and are assigned the highest priority level. 
They mostly convey audio- video synchronization data and coding control information related with 
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each picture level. The loss of such data will, in the best case, generate audio or picture silences, 
and in the worst case, the loss of the whole MPEG TS packet. For instance, the slice start code 
field located at the slice header indicates the position of each slice in the whole picture, and 
obviously has to be preserved from loss. The cells carrying I-frames data should also have a high 
priority level, since they embed intra picture information and are referenced by other frames. The 
two other types of frames (P and B) are assigned a lower priority levels. 

When a network congestion occurs, B-cells are firstly discarded, while P, I and 
system data cells are preserved from elimination during network crossing. To quickly reduce 
buffer occupancy as well as a low QoS degradation, it is suitable to perform an Early Packet 
Discarding scheme on successive groups of B-cells. Statistically, B-cells represents about 13 % of 
the whole streams for a common GOP pattern (N=12 and M=3). If buffer overflowing persists, the 
node may further discard P-cells (29%) and I-cells (53%). This measures will ensure a graceful 
and manageable QoS degradation, by means of a selective cells discarding strategy and 
cooperative data recovery techniques at the source and destination equipment. 

3.1.3 A video-oriented Adaptation ATM Layer for QoS control support 

In order to realize the previous multi-level priority management of MPEG-2 bit-streams, 
appropriate adaptation protocols are necessary. The ATM Forum recommends the carrying of two 
MPEG-2 Transport Stream packets per AAL5-PDU with a NULL Convergence Sub-layer [11] 
(Figure 4a). This way, the Quality of Service of bursty interactive VOD applications are not 
guaranteed. This topic is definitely still an open issue. 

Real-time video services require a more sophisticated AAL than the simple AAL type 5 
standard. Therefore, we propose an extended AAL5, namely AAL5+, which provides further 
picture quality control facilities. Among the additional features offered by AAL5+, there are a 
new specific CS layer and the ability to perform two different segmentation and packetization 
processes on two types of CS-PDU. 

The aim of the specific CS sub-layer is to allow a first-step MPEG data extraction process. 
The AAL5+ SSCS protocol, is composed of a one-bit header and a fixed payload. Regarding to 
the bit-header value, the SSCS-PDU carries MPEG-2 TS and PES headers (0) or only PES 
payload types (1). 




Figure 4a - Classical mapping of MPEG-2 Figure 4b - New mapping of MPEG 

Transport Stream packets over AAL5 Transport Stream packet over AAL5+ 
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One bit from the AAL5 CPCS-user-to-user indication field is newly used to discriminate 
between the SSCS-PDUs received from the upper-layers. The seven remaining (7) bits of the 
CPCS-UU field are used as a sequence number (SN) for our cell-loss detection and recovery 
scheme (see section 3.2.1). This is a light modification of the previous AAL5 standard which 
allows a full protocol interoperability. Moreover, the AAL 5 2 bytes control field is not yet used 
or specified by standard bodies. This modification allows the definition of two types of CS-PDU, 
namely System-oriented CS-PDU and Data-oriented CS-PDU (see Figure 4b). 

The under-laying SAR protocol will subsequently segment each PDU into ATM cells and 
set the ExCLP field according to the CS-PDU type and the MPEG syntax. A segmentation process 
is defined for each CS-PDU type. 

System-oriented PDUs are simply segmented into cells with each ExCLP field set to ‘00’ 
(i.e., very high priority), whereas the Data-oriented CS-PDUs are concerned with a different 
segmentation and packetization process. Actually, a parsing scheme, with respect to MPEG 
syntax, is performed to build the following data cells. This stream parsing will extract audio data 
and MPEG headers from picture data. The headers and audio data are then transmitted in cells 
with ExCLP field set to ‘00’, while for picture data, the ExCLP value depends on their coding 
mode. The current ExCLP’ s cell value is assigned according to the picture coding type field 
located in the frame header. This picture coding type field specifies for each frame the used 
coding mode (Intra, Predictive or Bi-directional). 

3.2 Reactive destination policies 

3,2. 1 System data protection through data-loss detection and recovery scheme. 

For a cell with one corrupted bit, a Header Error Control (HEC) model is efficient, but in ATM- 
based broadband networks, corrupted cells are usually the result of a physical transmission 
problem. When errors occur, they typically take the form of long bursts. Therefore, too much data 
are damaged in the burst and HEC scheme is not able to correct them. Our approach is to consider 
corrupted cells as missing cells and then perform a cell lost recovery algorithm on critical 
system data information. 

As depicted in Figure 5, cell loss can be corrected by structuring a set of AAL5+ PDU as a 
square block and padding an AAL5+ PDU embedding parity-check cells performed on each 
column. At the destination system, this can be seen as a virtual (N+l)*(M-i-l) cell matrix. Whereas 
at the source system, it is only an (N+1)*M cell matrix, with the (N+1) row of cells containing 
parity bits for checking the corresponding bits in the associated column direction. Since the source 
systems are able to insert a sequence number with our AAL5+ layer (using the new added 7 bits), 
cell loss can be detected and the position of the lost cell(s) easily located. The contents of the 
parity cell in the (N+1) row can then be used to restore the lost cell(s). At this step, no more than 
one cell per column can be recovered. Thus, in case of serious congestion transition, we proposed 
to process two additional parity-check cells in relation with each diagonal of the matrix. 
According to [12], this extended mode allows us to recover up to two lost cells per column. 

For bandwidth saving and fast processing considerations, we rather use the first mode. But 
if necessary, the second mode can be activated at reception of a notification message from the 
network (switches or destination equipment), or from a higher management entity. This scheme 
has also a useful feature in that it allows the parity-check cells to be computed while data is being 
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transmitted. This way, there is no need to buffer cells before transmission. But if a cell is lost and 
detected by the AAL5 CRC, the receiver will have to buffer the whole set of cells for the recovery 
process. 
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Figure 5 - Enhanced cell loss detection and recovery scheme. 



Due to the GOP pattern (N=12, M=3), I-frames are generated every 12 frames and 
approximately correspond to 50 % of the whole data amount. P and B frames respectively 
represent 30 and 15%. To reduce control overhead and processing time, this data recovery process 
is only performed on System data cells (about 5%). If missing or erroneous I-cells are detected, a 
block-based bit interleaving scheme is performed at the destination to reduce errors effects on 
local picture sections. 

3.2.2 I-frames protection by Block Interleaving 

In the MPEG video coding scheme, adjacent data are highly correlated. This high degree of 
correlation can be used to hide any missing picture block by using data from adjacent or closest 
possible blocks. This technique is only effective if adjacent coded data are interleaved in such a 
way that encoded data for neighboring block regions are placed in separated transport data units 
(i.e., ATM cells). 

We use such a strategy at the SAR protocol level to protect referenced Intra-coded frames 
from random block losses. The cell segmentation and packetizing would then proceed not 
according to a traditional scanning order, but in an interleaved order. This way the effect of an I- 
cell loss, at the receiver, will be spatially dispersed over a larger picture section and will enhance 
displayed picture quality. 

With consumer High-Definition TV standard (1440x11 52x3 O^s), each MPEG-2 block 
section is composed of 8*8 picture elements (pixels) with chromatic and luminance resolutions of 
24-bits (i.e., high color format). Each MPEG-2 block is then transported in exactly four (4) ATM 
cells (8*8*24 = 1536 bits =192 bytes = 4*48 ATM payload types). Using this property, we define 
an efficient bit interleaving process on MPEG-2 block and ATM cell basis illustrated by Figure 7. 
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Figure 7 - Bit interleaving on block and ATM cell basis 



4 CONCLUSION 

In this paper, we have surveyed a number of topics related to the coding and control of audio- 
visual on-demand applications conveyed over ATM networks. A particular attention has been 
given to the MPEG-2 compression scheme and the associated transport data structure, since it is 
adopted as the coding standard for deploying multimedia services over Broadband-ISDN. More 
specifically, we have analyzed the network factors affecting the Quality of Service of MPEG- 
encoded VOD applications and we have shown the need of dedicated video-oriented QoS control 
policies for such a service. 

In this perspective, the framework presented in this paper has been designed to allow a 
refined picture quality control based on MPEG-2 data structure and characteristics. The 
framework implements in a harmonized way proactive and reactive control policies at both the 
network edge (i.e., codecs) and within the network (i.e., switches). The proactive policies allow to 
protect highly sensitive data through a four-level priorities assignment. The reactive policies are 
performed at the destination equipment to recover cell loss and errors. By introducing semantic 
information at the transmission level (i.e., ATM cell level) concerning types of the data carried by 
the cells, we have shown how the proposed framework together with the defined enhanced 
Adaptation ATM Layer type 5 allow to discriminately and efficiently deal with network 
congestion. 

The ultimate aim of this work has been to ensure graceful end-to-end picture quality 
degradation for current video applications over broadband networks. For demanding real-time 
compressed video service which is characterized by a high congestion probability, we believe that 
our dedicated framework will bring a better quality. The overhead introduced by the complexity 
of the presented control policies will be lightened by the processing power available in today's 
multimedia end equipment and by off-line processing capability. For instance, Video-on-Demand 
servers have an easier time to store all the programs as ATM cells with priority assignment rather 
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than just bit streams [13]. Finally, we are now planning to experiment our approach and the 
implemented algorithms on an ATM testbed that is currently being deployed. 
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Abstract 

We present and discuss detailed measurements showing how TCP behaves in an ATM LAN 
where a Variable Bit Rate (VBR) traffic contract is enforced by a Usage Parameter Control 
(UPC) mechanism. These measurements show that TCP has a lot of difficulties to adapt its 
behaviour to such a traffic contract. In our environment, TCP only achieved a 10% utilisation 
of a 10 Mbps VBR VC. We then discuss and evaluate possible solutions to improve this low 
utilisation of the reserved bandwidth. 



1 INTRODUCTION 

Asynchronous Transfer Mode (ATM), selected in 1988 by the ITU as the basis for the B-ISDN 
has quickly raised a huge interest within the datacommunications community, and it soon 
appeared as a promising technology for the LAN environment. Many consider ATM’s inherent 
scalability, as well as the provision of QoS guarantees, as its main benefits compared with other 
high-speed LAN technologies. 

However, to quickly benefit from ATM with today’s applications and networking protocols, 
it was necessary to achieve a seamless integration of TCP/IP and other widely used networking 
protocols within the connection-oriented ATM environment. The connection-oriented nature of 
ATM does not fit well with most of the current networking protocols which have been 
developed with a connectionless environment in mind. Both the Internet Engineering Task 
Force (IETF) and the ATM Forum have proposed solutions to integrate TCP/IP (IETF) as well 
as other networking protocols (ATM Forum) within the ATM environment. These two 
solutions, respectively Classical IP over ATM and LAN Emulation, define mainly how the 
address resolution must be done from an IP or MAC layer address to an ATM address, and 
how and when ATM Virtual Circuits (VCs) are established between communicating entities. 

However, this address resolution problem, while being an important, bootstrap-like, 
problem, is not the only one that must be solved for a seamless integration of TCP/IP in the 
ATM environment. The pending issues include notably the suitability of TCP's congestion 
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control scheme for the cell-based ATM environment, how to perfom routing efficiently in a 
mixed IP/ATM network, but also how to benefit from the QoS guarantees provided by ATM. 

In this paper, we first review the three types of traffic contracts supported by today’s ATM 
standards. Then, we present and discuss detailed measurements showing how TCP is able to 
adapt its behaviour to a Variable Bit Rate (VBR) traffic contract, and look at some possible 
improvements. 



2 ATM TRAFFIC CONTRACTS 

When ATM was selected by the ITU, it was assumed that the users of the B-ISDN would be 
able to define precisely their bandwidth requirements through a traffic contract or traffic 
descriptor. The network would use this traffic contract to perform operations such as routing. 
Connection Admission Control (CAC) and Usage Parameter Control (UPC). There were two 
possible approaches for the traffic contract: a statistical and an operational one [DeP93]. The 
first approach uses stochastic traffic parameters such as an average cell rate or an average burst 
size. Such parameters can be used to perform CAC, but it would have been difficult to use them 
for UPC, because long observation times are usually necessary to verify the compliance to a 
statistical traffic contract while the UPC must react quickly to traffic contract violations to 
protect the network from misbehaving users. The operational approach defines the traffic 
contract with one algorithm and a few associated parameters. This approach is suitable for both 
CAC and UPC as it allows an unambiguous discrimination between the conforming and the 
non-conforming cells. The algorithm selected by the ITU is known as the Generic Cell Rate 
Algorithm(GCRA) [1371]. 

The GCRA uses two parameters : the Cell period T and the allowed Cell Delay Variation 
(CDV) T . T is the reciprocal of a Cell Rate and, intuitively, T corresponds to a maximum 
allowed short-term deviation from this Cell Rate. 

The current UNI signalling specification [UNI30] uses this GCRA to define three types of 
traffic contracts : 

- Constant Bit Rate (CBR) 

- Variable Bit Rate (VBR) 

- Best Effort (later renamed Unspecified Bit Rate [Sat96] ) 

The CBR traffic contract is suitable for applications which transmit at a constant bit rate 
during the whole connection duration, e.g. leased line emulation or uncompressed voice or 
video over ATM [Sat96], but nothing precludes its use by data applications. This traffic contract 
is defined by GCRA [ TpcR, T ], where TpcR is the reciprocal of the Peak Cell Rate reserved 
for the CBR VC, and T the allowed CDV. The main characteristic of a CBR VC is that the ATM 
network will reserve resources according to the Peak Cell Rate (PCR) of the corresponding 
traffic contract and will guarantee a low (almost 0) cell loss ratio for this VC. However, this 
traffic contract is seldom used for data applications, as these applications are usually very 
bursty. 

The second traffic contract is the Variable Bit Rate traffic contract. This VBR traffic contract 
can be used for compressed audio and compressed video applications, but also for response 
time critical transaction processing (e.g. airline reservation, banking transactions, ...) [Sat96]. 
It is defined by two GCRAs. The first one, GCRA [ TpcR, Tpcr ], specifies the peak cell rate 
for the VBR VC. The second one, GCRA [ TscR» '^SCR ]» specifies the Sustainable Cell Rate 
(SCR) and the allowed CDV for this SCR. Intuitively, the SCR is the rate at which the source 
may transmit during an infinite time. The allowed CDV for the SCR, TscR» is not specified 
directly by the end-system at connection establishment time. Instead, the end-system specifies 
the Maximum Burst Size (MBS). Intuitively, the MBS corresponds to the largest burst of ATM 
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cells that can be sent at the Peak Cell Rate. TsCR can be calculated [UNDO] from the MBS as 
follows : 

T^SCR = (MBS -1) X (TsCR - TpCR) (1) 

The main characteristic of a VBR VC is that the network will reserve resources for this VC 
according to the specified PCR, SCR and MBS so that if the end-system respects the traffic 
contract, the cell loss ratio for the VC will be low. The traffic contract will be enforced by a 
UPC using the two OCR As operating in parallel on the cell flow. This UPC will either discard 
or tag (i.e. change the CLP bit from 0 to 1) the non-conforming cells depending on which one 
of the 3 VBR traffic contracts (table 1) supported by ATM Forum UNI 3.0 is used for the VC. 
These traffic contracts differ on the treatment of the low priority cells (CLP=1) and also on 
whether the non-conforming cells should be tagged or discarded by the UPC. 



Table 1 : VBR Traffic contracts supported by ATM Forum UNI 3.0 specification 
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CLP=0+1 
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CLP=0+1 


specified for 
CLP=0 
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CLP=0 


requested 


C 


specified for 
CLP=0+1 


specified for 
CLP=0+1 


specified for 
CLP=0+1 


specified for 
CLP=0+1 


not applicable 



The best-effort contract is the last traffic contract introduced in ATM Forum UNI 3.0. It is 
defined by GCRA [ TpcR, T ]. However, in contrast with a CBR VC, the network does not 
reserve any resources for a best-effort VC. Thus, there are no QoS commitments for a best- 
effort VC. 

The best-effort traffic contract is used by most ATM LANs today. Unfortunately, by using 
this traffic contract, these ATM LANs do not benefit from the QoS guarantees that can be 
offered by the ATM network. These QoS guarantees would surely be useful in environments 
where mission critical applications (e.g. airline reservation, banking transactions) are used and 
thus should get a guaranteed minimum amount of network resources, even when the network is 
heavily loaded. With ATM Forum UNI 3.0^ the VBR traffic contract appears to be the best 
solution to support such bursty applications as it allows reservation of resources, as well as a 
certain amount of burstiness. With the third (C) VBR traffic contracts (table 1), the applications 
will always get the exact amount of resources specified in the traffic contract. With the first (A) 
VBR traffic contract, the applications will get the resources specified in the traffic contract when 
transmitting high priority (CLP=0) cells, but may get additional resources (on a best-effort 
basis) when they transmit low priority (CLP=1) cells. With the second (B) VBR traffic 
contract, the applications may get more than the amount of resources specified in the contract, 
either by transmitting low priority cells, or by sending high-priority cells in excess of the traffic 
contract and assuming that these cells (which will be tagged by the UPC) will be successfully 
transmitted if the network is lightly loaded. However, in a heavily loaded network, the VBR 
VCs will get exactly the resources specified in their traffic contract and enforced by the UPC. 
To benefit from the VBR VC, the higher layer protocols and the applications should be able to 
use this specified minimum amount of resources efficiently. In the remaining parts of this 
paper, we study how TCP is able to use such a minimum amount of resources. 



^ Both the ITU and the ATM Forum are working on new traffic contracts (Available Bit Rate and ATM Block 
Transfer), but it will take some time before they are supported by products and widely deployed. Many ATM 
networks can be expected to continue to use the UNI 3.0/3. 1 specifications for some time. 
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3 MEASUREMENT ENVIRONMENT 

Our measurement environment (figure 1) consisted of two Sparc 10 class workstations and one 
ASX-200BX ATM switch from Fore Systems. 

The two workstations were running SunOS 4.1.3_U1 and were connected to the ASX-200BX 
with 155 Mbps links. The sending workstation used a Fore SBA-200 ATM adapter, while the 
receiving workstation used a more recent Fore SBA-200E ATM adapter. They both used release 
4.0 of the ATM driver. 

The TCP implementation in SunOS 4.1.3_U1 is a subset of TCP^ Reno. It includes the 
slow-start and congestion avoidance algorithms, but lacks the window extensions and 
timestamp options defined in [JBZ92]. The implementation of the fast retransmit algorithm 
[Ste94] is incomplete in SunOS 4.1.3_U1 [BKD96], but this bug had been fixed for our 
measurements. 



Sender Receiver 




ASX-200BX 
VBR Traffic contract : 

T(PCR) = 2.726 |Lisec/cell T (PCR)=0 iisec 
T(SCR) =42.4 iLisec/cell T (SCR)=fct(MBS) 

Figure 1 Measurement environment 

The ASX-200BX is a 2.5 Gbps bus-based non-blocking output buffered ATM switch. It 
supports CBR, VBR and best-effort VCs, and contains dual-leaky buckets used for the UPC. 
For our measurements, we established one bi-directional PVC between the two workstations. 
There was no other ATM traffic through the switch during our measurements, and thus there 
was no congestion inside the switch. The traffic contract enforced by the UPC of the ASX- 
200BX was set to a Peak Cell Rate of 155 Mbps (i.e. one cell every 2.726 jiisec which is equal 
to the physical line rate), no specified allowed Cell Delay Variation for the Peak Cell Rate, a 
Sustainable Cell Rate of 10 Mbps (i.e. one cell every 42.4 jiisec), and we varied the Maximum 
Burst Size during the measurements. 




Figure 2 An equivalent target network 

This measurement environment will also allow us to evaluate the performance in the network 
of figure 2 where the sender (e.g. a server) is connected to an interworking unit (IP router or 



^ The reader unfamiliar with TCP may find a good survey in [Ste94]. 
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Ethernet Hub) with a 10 Mbps VBR VC through an ATM backbone and the receiver (e.g. a 
client) is connected to the interworking unit with a 10 Mbps Ethernet. 



4 MEASUREMENTS WITH TCP 

Our measurements were performed in the environment shown in figure 1 with the standard 
ttcp [S1M84] measurement tool. Each measurement was reproduced several times, and 
consisted in the memory-to-memory transfer of 4 MBytes of data in 16 KBytes long buffers. 
We enabled the TCP_DEBUG option for all the TCP connections to gather some packet traces. 
The receive and transmit window sizes were always set to 49152 bytes on both the sender and 
the receiver. By using the same transmit and receive windows on both workstations, we avoid 
the deadlock problems with SunOS 4.1.3_U1 discussed in [MoG95]. The round-trip-time 
measured by ping with small packets between the sending and the receiving workstation is 
about 1800 iiisec. 

During our measurements, we varied the MTU size used by IP over ATM. The selected 
MTU sizes correspond to the most commonly used MTU sizes for IP over ATM^.We also 
varied the Maximum Burst Size enforced by the UPC. 

We choose 1000 cells as the minimum value for the MBS as we have already discussed in a 
previous paper [BKD96] how TCP behaves when a UPC enforces a MBS corresponding to 
only a few packets. 

We choose 4000 cells as the maximum value for MBS as our ASX-200BX (whose CAC 
algorithm is based on [Gue91] ) could not establish (and associate QoS guarantees to) one ATM 
VC with a MBS much larger than this value. 

The results of our first measurements (figure 3) were disappointing. We established one 10 
Mbps VBR VC through the switch to benefit from QoS guarantees, and the throughput 
achieved by TCP with the larger MTU sizes is below 1 Mbps. The throughput achieved by TCP 
with an MTU size of 552 bytes is slightly higher, but still far from satisfactory. This is 
obviously a too small utilisation of the resources reserved for the 10 Mbps VBR VC. 




A look at the packet traces gathered during the measurements revealed that TCP sent bursts 
of packets separated by an idle time of roughly one second. The TCP statistics reported by 
ne t s ta t - s showed a large number of retransmitted packets. 



^ 552 bytes corresponds to the default segment size used by TCP in the wide area, 1500 bytes corresponds to an 
Emulated Ethernet [LAN95], 4832 bytes to an Emulated TokenRing[LAN95], and 9188 bytes is the default 
MTU size for IP over ATM[Atk94] 
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The packet losses which caused this large number of retransmissions are due to non- 
conforming cells being discarded by the UPC. This can be explained by looking at the 
behaviour of our ATM adapter and at the GCRA. The large idle times, which caused most of 
the throughput drops, are due to the implementation of TCP's retransmission timer in SunOS 
and BSD-derived implementations and will be discussed later. 

4.1 Explanation for the packet losses 

To explain the packet losses, we can consider the packet train model shown in figure 4. This 
model is close to the behaviour of our ATM adapters. When a TCP packet is sent by the SunOS 
kernel, it is encapsulated inside one IP packet and passed on to the ATM driver. The driver 
enqueues this AAL-SDU and then wakes up the firmware of the ATM adapter. The AAL5 
segmentation is performed inside the ATM adapter before the transmission of the corresponding 
ATM cells on the output link. In our environment, the transmission of one AAL-PDU by the 
ATM adapter took less time than the generation of the TCP/IP packet and the ATM driver 
processing. Thus, two consecutive AAL-PDUs were always separated by some idle time on the 
output link. 



AAL-SDU(IP packet) 
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Figure 4 Packet-train model 

The ATM traffic was always conformant to the PCR, as we set the PCR of the traffic 
contract to the PCR of the 155 Mbps link. Thus, all the cells discarded by the UPC were 
discarded due to a non-conformance with the SCR and MBS parts of the VBR traffic contract. 
From the specification of the GCRA, and the packet train model, it can be shown that a 
sequence of p AAL-PDUs separated by an inter-PDU time Ip, with each packet containing c 
cells, will be compliant with the traffic contract enforced by the GCRA if, considering that the 
UPC is in the idle state upon arrival of the first cells : 



Vk = 0... (p- l),Vj=0 ... (c-1): 

(k X c j) X TscR < k X ( c X 2.726 + Ip) + j x 2.126+ TscR (2) 

In an ATM LAN, TCP’s behaviour is very close to this packet train model. We were not 
able to measure the Inter- AAL-PDU time directly, as this would have had required an ATM 
analyser which was not available where we performed our measurements. Instead we measured 
the Inter-AAL-SDU time with a modified (but older - release 2.2.9) Fore ATM driver. This 
modified driver [BKD96] is able to log in a kernel table a timestamp and some related 
information for each AAL-SDU sent or received by the driver. When we used TCP with this 
modified driver, the timestamps showed that there were some variations in the Inter-AAL-SDU 
times. These variations occurred, e.g. when TCP had to process one received 
acknowledgement or fetch some data from the user process’ buffer. However, 600 psec and 
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900 )Lisec appeared as reasonable estimations for the Inter- AAL-SDU times with respectively the 
4832 and 9188 bytes MTU sizes . This corresponds to respectively 324 and 376 |isec for Ip, 
and to raw throughputs^ of respectively 7 1 Mbps and 90 Mbps. 

The packet traces have shown that TCP transmitted burst of packets separated by an idle time 
of roughly one second. Figure 5 compares the maximum burst length (in packets) predicted by 
the packet train model with the average burst size found during the measurements. The model 
seems to be a good approximation of the measured values. Some of the differences between the 
model and the measurements can be explained by looking at the packet traces, but we do not 
discuss these details due to space limitations. 




Figure 5 Comparison between the measurements and the packet train model 



It should be noted that with a regular traffic such as the one shown in figure 4, once the 
UPC has accepted a burst of maximum length sent at the PCR, it will discard cells from all the 
subsequent AAL5-PDUs sent at the PCR until the link has been idle for a sufficient amount of 
time. The UPC will only accept a new entire AAL-PDU of c ATM cells sent at the PCR if the 

link has been idle for at least c x (TscR * TpcR ) Itsec (e.g. 7.6 msec with the 9188 bytes 
MTU in our environment). The Inter- AAL-PDU time measured with our ATM adapters is much 
smaller than this value. 

With the smaller MTU sizes, the agreement between the model and the measurements was 
not so good, especially with the 552 bytes MTU. This is probably due to the fact that as the 
workstations are more heavily loaded with these small MTU sizes, the assumption that the 
Inter- AAL-PDU time is constant is not valid. 



4.2 Explanation for the large idle times 

The large idle times, which resulted in a low throughput, are due to the particularities of the 
implementation of TCP's retransmission timer in SunOS and BSD-derived implementations 
[WrS95]. In BSD-derived implementations, this retransmission timer is handled by the 
tcp_slowtimo routine which is called every 500 msec by the kernel [LMK89]. On each 
invocation, this routine checks for all the active TCP connections whether the retransmission 
timer has expired or not, and it invokes the tcp_send routine for each TCP connection whose 
retransmission timer has expired. 

During our measurements, TCP transmitted a burst of packets. Cells from the last packets of 
this burst were discarded by the UPC, and TCP had to wait for the expiration of the 



4 By raw throughput, we mean the total throughput sent on the output link, i.e. including the user data, but also 
the ATM, AAL5, IP and TCP overheads. 
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retransmission timer to retransmit the lost packets. As the minimum value for TCP's 
retransmission timer is two 500 msec periods, the sender had to wait for approximately one 
second between two bursts of packets [BKD96]. 

A look at the packet traces revealed that sometimes the idle time jumped to two seconds 
instead of one. These larger idle times were caused by the loss of one retransmitted packet. 
When one retransmitted packet is lost, TCP's retransmission timer is doubled (exponential 
back-off), and this explains the larger idle time. The loss of one retransmitted packet was 
caused by the corruption of the corresponding AAL5-PDU by unreassembled cells from 
previously transmitted AAL5-PDUs. These losses could have been avoided with the optional 
AAL5 reassembly timeout (RAS) [1363], but this timeout was not supported by our ATM 
adapters [Cyp96]. 

We would like to point out that if we had used several TCP connections multiplexed into one 
VBR VC instead of a single one, the utilisation of the VC would not have improved 
significantly. Assuming that these TCP connections start at random times, each connection will 
transmit one burst of packets and then the loss of the last packets of the burst will force it to 
wait for the expiration of its retransmission timer. Unfortunately, the retransmission timer will 
expire at the same time for several connections, and all these connections will start to retransmit 
together. A few packets will be retransmitted for each connection, before they will all be forced 
to wait for the expiration of their common retransmission timer. Obviously, such a 
synchronisation among several connections will not lead to a better utilisation of the VBR VC. 



5 POSSIBLE CHANGES TO THE TCP IMPLEMENTATION 

The TCP implementation used in SunOS 4.1.3_U1 is a production quality implementation 
which has been highly optimised during the years [MKK94]. It is representative of a wide 
range of currently available BSD-derived implementations. However, it does not contain all the 
TCP modifications which have been proposed recently. We will discuss the impact of some of 
these modifications in the following sections. 

5.1 Reducing the granularity of TCP’s retransmission timer 

The 500msec granularity for the retransmission timer is obviously too large for the ATM 
environment. The current release of Solaris uses a 200 msec granularity for the retransmission 
timer [Ste94], and a 50 msec granularity for the delayed acknowledgement timer. We modified 
SunOS 4.1.3_U1 to use these smaller timer granularities. The measurements with these lower 
timer granularities (figure 6) showed that the throughput achieved by TCP almost doubled. 
With a MTU size of 552 bytes, TCP achieved a throughput of almost 6 Mbps with a MBS of 
4000 cells, but only 2Mbps with a MBS of 1000 cells. With the larger MTU sizes, the TCP 
throughput was lower than 2 Mbps. 

In SunOS 4.1.3_U1, any kernel timeout, including TCP's retransmission timer, depends on 
the sof tclock [LMK89] routine which is invoked by a clock interrupt every 10 msec. Thus, 
the minimum value for the retransmission timer in SunOS 4.1.3_U1 is 10 msec^. Figure 7 
presents measurements performed with a 10 msec retransmission timer. For these 
measurements, we disabled the delayed acknowledgement timer in the receiver because this 



^ Please note that this value is much higher than the 100 psec value used in some simulation studies [RoF94]. 
To provide such a 100 psec retransmission timer in BSD-derived implementations, the kernel should process one 
clock interrupt every 100 psec. This would be very difficult with today’s workstations and operating systems 
given the high cost of switching from user space to kernel space [Ous90]. Some kind of hardware support for the 
retransmission timer would probably be necessary to achieve such a 100 psec retransmission timer with BSD- 
derived implementations. 
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timer should ideally have a value smaller than the retransmission timer in order to avoid 
inaccuracies in the round-trip-time estimation. 




Figure 6 Measurements with a 200 msec granularity for the retransmission timer 

From figure 7, one could conclude that the granularity of the retransmission timer should be 
set to 10 msec and that all the problems will be solved. This is not entirely true. The high 
throughput achieved with this low granularity is unfortunately accompanied by a huge number 
of retransmitted packets as shown by figure 8. This figure compares the number of 
retransmitted packets (reported by nets tat -s )with the 500 msec, 200 msec and 10 msec 
retransmission timer granularities and an MTU size of 9188 bytes. 




Figure 7 Measurements with a 10msec retransmission timer 

We would like to point out that to transmit 4 MBytes of data, TCP must transmit 459 data 
packets with a MTU size of 9188 bytes. Thus, with the MBS set to 1000 cells and a 10 msec 
retransmission timer, 80% of the data packets are retransmitted. With the MBS set to 4000 
cells, the percentage of retransmitted packets drops to 40 % with the 10 msec retransmission 
timer, and 20% for the 200 and 500 msec retransmission timers. These high percentages of 
retransmissions show clearly that TCP, by itself, cannot adapt its behaviour to a VBR traffic 
contract in an ATM LAN. 

The throughput drop which appeared with the MBS set to 2000 cells and a MTU size of 
9188 bytes is linked with a kind of race condition. The TCP packet traces revealed an almost 
periodical behaviour. TCP first sent a burst of about 15 packets. Cells from the last 5 packets of 
this burst were discarded by the UPC. At that time, TCP's retransmission timer was set to 20 
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msec (the minimum value). The first retransmitted packet was corrupted with cells from 
previous incompletely reassembled packets, and thus discarded by the ATM adapter of the 
receiver. As this retransmission failed, the exponential back-off algorithm set the retransmission 
timer to 40 msec. The packet sent after the expiration of this timer was acknowledged by the 
destination. The sender performed slow-start, and successfully retransmitted the other 4 
previously transmitted packets. The acknowledgements corresponding to these 4 packets did 
not update the value of TCP's retransmission timer as specified by Kam's algorithm [KaP89]. 
Unfortunately the next packets were discarded by the UPC, and thus did not reach the 
destination. TCP's retransmission timer expired after 80 msec. The first retransmitted packet 
was discarded by the ATM adapter of the receiver as the corresponding AAL5-PDU was 
corrupted with cells from the previously transmitted, but unreassembled, AAL5-PDUs. After 
160 msec of idle time, TCP attempted a new retransmission. This retransmission was 
successful, and TCP was able to send a burst of about 15 packets... 

With the other MTU sizes and MBS, the retransmission timer did not reach such a high 
value. This throughput drop would have been avoided with the timestamp options specified in 
[JBZ92] as when these options are used, TCP can update its retransmission timeout, even when 
it receives acknowledgements from retransmitted packets. 




Figure 8 Number of retransmitted packets with the various retransmission timer granularities 



5.2 Other retransmission strategies 

TCP's retransmission capabilities are probably one of its weakest points. The original 
specification only supported positive cumulative acknowledgements. Based on these positive 
acknowledgements, TCP's natural retransmission strategy is go-backn and the retransmissions 
are triggered by the expiration of the retransmission timer. This is obviously not the best 
solution in a lossy network. To overcome this limitation, many TCP implementations, including 
the one we used, support the fast retransmit algorithm. The fast retransmit algorithm [Ste94] 
assumes that a receiver will immediately send an acknowledgement when it receives an out-of- 
order data packet. When a TCP sender receives three duplicate acknowledgements pointing to 
the same data packet, it assumes that the first unacknowledged packet has been lost, and this 
packet is selectively retransmitted. This solution works well provided that the window size is 
sufficiently large [BKD96], and that no more than a single data packet is lost within a single 
window. If these conditions are not fulfilled, TCP will have to wait for the expiration of its 
retransmission timer, and this may cause large idle times. In our environment, the fast 
retransmit algorithm was not sufficient to recover from the losses caused by the UPC, even 
when we set the duplicate acknowledgements threshold to 2. This is due to the fact that there 
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was no long enough idle time between the last data packet transmitted and the retransmitted 
packet, and thus cells from the retransmitted packet were also discarded by the UPC. 

The other enhancements to TCP’s retransmission algorithm proposed recently like TCP 
Vegas [BOP94] or the new selective acknowledgements option [MMF96] would not have 
significantly changed the results of our measurements as all these strategies try to retransmit the 
lost packet as soon as possible. In a wide area network, however, these improved 
retransmission algorithms would probably be much more useful. 

5.3 Traffic shaping 

Our measurements have shown that TCP by itself cannot adapt its behaviour to a VBR traffic 
contract. To fully benefit from the VBR traffic contract with TCP, some additional support 
would be needed from the ATM driver or adapter. For the third (C in table 1) VBR traffic 
contract, the solution could be to implement a dual-leaky bucket inside the ATM driver or 
adapter and to use this leaky bucket to shape the ATM traffic according to the VBR traffic 
contract. Our ATM adapters did not provide such leaky buckets, but when we set the cell rate of 
the ATM adapter to 10.000 kbps^ (i.e. the SCR of the traffic contract), TCP achieved a good 
utilisation of the VBR VC (figure 9) with the default 500 msec retransmission timer. 




Figure 9 Measurements with the PCR of the ATM adapters set to 10.000 kbps 

With the large MTU sizes, the throughput reached about 90% of the theoretical maximum if 
we take into account the overheads of the ATM cell header, AAL5 trailer and IP and TCP 
headers. There were no packet losses during the measurements with the 1500, 4832 and 9188 
bytes MTU sizes. This is in contrast with the measurements with the 10 msec retransmission 
timer (figure 7), in which the percentage of retransmitted packets was very high. With a MTU 
size of 552 bytes, the achieved throughput is below 6 Mbps. This is due to the fact that with 
this MTU size, the receiving workstation has difficulties to sustain the load, and it drops some 
received packets. Indeed, 6Mbps is the highest throughput we achieved with TCP, a MTU size 
of 552 bytes, a window size of 49152 bytes and the default 500 msec retransmission timer in 
our environment, even when the UPC was disabled. 

This traffic shaping could also be implemented inside the ATM driver with a rate control at 
the AAL5-PDU level. A regular flow of c cells long AAL5-PDUs (figure 4) will be conforming 



^ Of course, by setting the cell rate of the ATM adapter to the SCR of the traffic contract, we do not benefit 
from burstiness allowed by the VBR traffic contract, but this was the only solution usable with our ATM 
adapters. In a production network with similar ATM adapters, it may be easier to use the CBR traffic contract 
instead of the VBR traffic contract when QoS guarantees are required. 
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with a VBR traffic contract provided that the idle time (Ip) between successive AAL5-PDUs 
verifies [UNDO] the following equation 

c = 1 + floor [ min ( Ip - TscR , "Cscr) / ( TscR - Tpcr) ] (3) 

It should be noted that the value of ^ given by equation (3) is valid when the source sends a 
regular flow of c cells long AAL5-PDUs at the PCR during an infinite amount of time. For 
example, with a MTU size of 9188 bytes (192 cells) and the VBR traffic contract used for our 
measurements, the minimum value of Ip is 7.6 msec. If the source sends its cells with a cell 
period larger than Tpcr, the minimum value of Ip will be smaller. If the source sends bursts of 
a few AALS-PDUs, Ip can be smaller provided that the idle time between successive bursts is 
large enough. Indeed, the VBR traffic contract allows a source to send bursts of MBS cells at 
the PCR provided that the idle time between two successive bursts is at least equal to TscR • If 
a rate control mechanism had to be implemented in SunOS, it could be controlled with a routine 
invoked by sof tclock every 10 msec [K1B96]. If necessary, it could be possible to control 
both the cell rate of the ATM adapter and the Inter- AAL SDU time. 

To fully benefit from the second VBR traffic contract ( tagging requested - B in table 1), it 
would probably be necessary to use some kind of feedback from the receiving ATM adapter or 
driver. With this traffic contract, the UPC tags the cells which are sent at a rate higher than the 
rate allowed by the SCR and MBS. If the network is lightly loaded, these cells will reach the 
destination. If the network is heavily loaded, they will probably be discarded by a congested 
switch before reaching the destination. Instead of letting the UPC decide which cells should be 
tagged and which cells should be left unchanged, the sending ATM adapter/driver should 
already send the cells which are known to be non-conforming with CLP=1 (the last cell of an 
AAL5-PDU should always be sent with CLP=0 as the loss of this cell will usually corrupt the 
next AAL5-PDU). With this "preventive" tagging, it will be able to tag entire AAL5-PDUs 
while still being able to transmit complete AAL5-PDUs with CLP=0 cells. This would help to 
avoid undesirable situations in which the UPC tags a few cells from each AAL5-PDU. The 
receiving ATM adapter/driver should also send some feedback to the sending ATM 
adapter/driver when it notices corrupted CLP=1 AAL5-PDUs, as this might indicate that the 
network is congested. A similar usage of the high and low priority cells would be necessary to 
fully benefit from the first (A) VBR traffic contract. 

With LAN Emulation and Classical IP over ATM, this could require a new standardised 
protocol within the ATM or AAL layer. It seems currently unlikely that the IETF or the ATM 
Forum will work on this subject. A new transport protocol, which uses directly the AAL 
service (e.g. [AKS96]) could be optimised to adapt its behaviour as a function of the traffic 
contract of the underlying VC (CBR, VBR or best-effort), but also as a function of the 
network's congestion level (with the best-effort and second VBR traffic contracts). 



6 SHALL WE CHANGE THE UPC ? 

In [RoF94], some simulations of a heavily congested ATM LAN have shown that TCP's 
congestion control had difficulties to adapt to an environment where the ATM switches discard 
ATM cells and not entire TCP packets when they are congested. This paper has also proposed 
two modifications. Partial Packet Discard (PPD) and Early Packet Discard (EPD), to the ATM 
switches. With PPD, an ATM switch should drop the tail of one AAL5-PDU once it has 
dropped one cell from this AAL5-PDU due to congestion. With EPD, an ATM switch should 
drop entire AAL5-PDUs when the occupation of its buffer is above some threshold. These 
discarding strategies have been implemented in some switches, and the next release of the ATM 



^ floor [x] is the largest integer which smaller or equal than 
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Forum signalling specification will support a new Information Element that can be used by an 
end-system to request the network to treat its cell flow as frames (i.e. AAL5-PDUs). It might be 
tempting to also implement such packet discarding strategies inside the UPC. 

PPD could easily be implemented within a UPC at the expense of one additional bit of 
memory. However, it remains to be seen whether such a modification of the UPC could 
significantly help protocols such as TCP. Anyway, we believe that it is much better to invest on 
good traffic shaping capabilities within the ATM driver/adapter than on modifications to the 
UPC. 

It would be more difficult to implement EPD inside a UPC. To implement EPD (i.e. discard 
entire AAL5-PDUs when they are known to be non-conforming), the UPC should be able to 
determine whether a complete AAL5-PDU will be conforming upon reception of the first cell 
from this PDU. However, to do this, the UPC would need to know both the length of the 
AAL5-PDU as well as the rate at which it will be transmitted. Unfortunately, both these values 
are unknown when the first cell of one AAL5-PDU is received. With a VBR traffic contract, a 
UPC implementing EPD could assume that the AAL5-PDU will be sent at the PCR of the 
contract, and that the AAL5-PDU will be of maximum length (this can be determined from the 
Maximum CPCS-SDU size parameter found in the AAL Information Element of the signalling 
messages [UNI30]). However, a UPC which relies on these two assumptions would probably 
discard AAL5-PDUs (e.g. short AAL5-PDUs) that would have been declared as conforming by 
a UPC based on the GCRA. This is obviously undesirable. 



7 RELATED WORK 

The interactions between transport protocols and the CBR and VBR traffic contracts have been 
studied in several papers. In [BKD96], we shown that in a wide area network where a CBR 
traffic contract was enforced by a UPC, the conformance of the ATM traffic sent by the adapter 
was very important. In [K1B96], we discussed similar measurements performed with XTPX (a 
derivative of XTP) which showed that selective retransmissions are not sufficient to recover 
from the losses caused by a UPC mechanism when the ATM traffic sent is not entirely 
compliant with a CBR traffic contract. 

Other researchers used simulations to study the interactions between TCP and the VBR 
traffic contract. [Mis95] reports simulations done in the wide area and in the local area which 
show similar throughput degradations due to the large granularity of TCP's retransmission 
timer in most TCP implementations. [EAR95] reports similar simulations. They also propose to 
replace TCP's slow-start and congestion avoidance algorithms by a leaky bucket and show by 
simulations that this reduces the number of packet losses. However, this solution is not 
appropriate from an architectural point of view with both LAN Emulation and Classical IP over 
ATM. With these two solutions, the TCP entity cannot determine whether the TCP connection 
uses a VBR VC or not (TCP is not even aware of the existence of ATM). Furthermore, with 
both LAN Emulation and Classical IP over ATM, several TCP connections can be multiplexed 
within a single ATM VC. Thus, the traffic shaping must be done on a per-VC basis, within the 
ATM or AAL layer and not within the TCP layer. 



8 CONCLUSION 

In this paper, we have presented detailed measurements showing how TCP behaves when it has 
to adapt its behaviour to a VBR traffic contract enforced by a UPC. 

Our measurements have shown that, with the default settings used by most TCP 
implementations and without shaping at the sender side, TCP has a lot of difficulties to adapt to 
such a traffic contract. Indeed, with these settings, TCP only achieved a 10% utilisation of a 10 
Mbps VBR VC during our measurements. We have also shown that reducing the granularity of 
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TCP's retransmission timer, even down to 10 msec, did not solve all the problems. With this 
small granularity, TCP achieved a good utilisation of the VBR VC, but this was at the expense 
of an unacceptable number of retransmitted packets (more than 80% with an MTU size of 9188 
bytes and a MBS of 1000 cells). 

In [BKD96], we had already shown that the ATM adapters should provide good ATM-level 
traffic shaping capabilities to support CBR VCs. The measurements presented in this paper 
clearly show that these traffic shaping capabilities must also cover the SCR and the MBS, 
otherwise many applications and legacy protocols will have problems to use a VBR VC. With 
legacy applications or protocols above LAN Emulation or Classical IP over ATM, this traffic 
shaping should be performed automatically by the ATM driver or adapter. It should be noted 
that while our measurements were performed with TCP, they apply to any protocol which does 
not use some kind of direct or indirect (e.g. with motion- JPEG, there is one burst of cells for 
each picture and an idle time between two successive bursts) rate control. 
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Abstract 

The Network Interface Framework (NIF) is an object-oriented software architecture for 
providing networking services in the Choices object-oriented operating system. The NIF 
supports multiple client subsystems, provides clients with low-latency notification of re- 
ceived packets, and imposes no particular structure on clients. By contrast, traditional 
BSD UNIX-style networking does not meet the last two requirements, since it forces clients 
to use software interrupts and queueing. BSD UNIX cannot accomodate a process-based 
protocol subsystem such as the x-Kernel, whereas the NIF can. We have ported the x- 
Kernel to Choices by embedding it into the NIF. Using the standard x-Kernel protocol 
stack with NIF yields Ethernet performance comparable to BSD networking. The NIF 
is also flexible enough to support services that cannot easily be supported by traditional 
BSD, such as quality-of-service for multimedia. Preliminary performance results for asyn- 
chronous transfer mode (ATM) networks show that the NIF can be used to minimize 
jitter for continuous media data streams in the presence of non-realtime streams. 

Keywords 

ATM, Network Interface Framework, resource exchanger, a;-Kernel 



1 INTRODUCTION 

The rising popularity of multimedia in distributed systems places greater burdens on the 
networking services in an operating system than in the past. An important problem is 
how the system should structure network services to support continuous media such as 
realtime audio and video efficiently. The requirements for a flexible network architecture 
are as follows. The operating system must have a software interface to the network that 
allows multiple users of the network hardware and a great deal of individued flexibility 
for a user when processing packets. The software interface must support multiple clients ^ 
it must support active clients^ and it must be flexible in client design and not force any 




78 



Part Three 



Communication Systems Architecture 



particular processing structure on clients. Multiple client support is essential to allow 
independent users to share the network hardware simultaneously. For example, one client 
might be handling a realtime video stream while another client handles ordinary data 
traffic. The clients can differ in how they handle received buffers, and neither must be 
allowed to starve the other of buffers. This is essentially a resource management issue. 

The other two requirements specify what the software interface must permit in terms of 
clients. Support for active clients means that clients must be able to perform some limited 
processing when their packets arrive. For minimum latency, clients must have access to 
the hardware interrupt handler. Some clients will perform very limited processing that 
should be done immediately, such as accumulating packet statistics, rather than requiring 
the overhead of context switching and synchronization to schedule processes to do it. 
Flexibility in client design requires that the software interface not impose any particular 
structure on clients. For example, the interface cannot force a client to use a queue on 
which packets wait to be processed; the client might need to take immediate action, such 
as adjusting process priorities in response to the packet. Clients must be free to design 
their processing models exactly to their needs. 

BSD UNIX has influenced a great many contemporary UNIX versions on which mul- 
timedia applications such as vat and Mosaic run today. It is therefore germane to ask 
whether the traditional BSD-style networking used on these systems supports the require- 
ments. Unfortunately, BSD networking (Leffler [1989]) does not satisfy the requirement 
that clients be active. BSD network clients are passive entities that are only indirectly in- 
voked by the driver. The hardware interrupt handler takes a received packet, looks up its 
network-level protocol and places the packet on the per-protocol queue. The driver then 
posts a software interrupt, and the software interrupt handler performs receive processing 
for all queued packets. (A software interrupt is handled by the processor when there are 
no pending hardware interrupts; it is the lowest-priority interrupt in the system.) This 
arrangement does not satisfy the requirement that clients have access to the hardware 
interrupt handler, and it also imposes a particular structure, packet queueing, on clients. 

We have constructed the object-oriented Network Interface Framework (NIF) to satisfy 
aU the stated requirements. Buffer transport between the driver and a network client is 
handled efficiently and with low-latency; copying is avoided whenever possible. The NIF 
manages resources to prevent buffer starvation of one client by another. The NIF also lets 
clients hook into the hardware interrupt handler so that they are active entities. Finally, 
the NIF does not impose any particular client policy for received buffers. Unlike BSD, 
queueing or transfer of buffer ownership is not required. 

The NIF provides a scaffold for embedding network clients, so the next step to hav- 
ing the equivalent of BSD’s network subsystem is to embed a client providing a protocol 
stack. For a primary network client, we have chosen the z-Kernel of Hutchinson [1991] 
as a network protocol subsystem whose architecture supports arbitrary composition of 
protocols and easy construction of new ones. The z-Kernel defines an explicit structure 
for the protocol-protocol interface, which makes it possible to compose protocols in an ar- 
bitrary fashion. The ^-Kernel uses a process-per-message approach, where each received 
packet is handled by a single process that shepherds it up through the protocol stack. 
Implementing the x-Kernel as an NIF client provides the system with common protocols 
like TCP/IP and a convenient network protocol architecture. QuaHty-of- service mech- 
anisms for multimedia can be easily implemented using a process-based cHent like the 
x-Kernel. For example, in ATM each transport-level connection can be assigned a client 
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and a thread pool for processing incoming messages. Our results show that even with a 
static scheduling policy this technique reduces variance in inter-frame arrival time on a 
video stream in the presence of non-realtime network traffic. 

We have implemented the NIF in our object-oriented operating system, Choices (Camp- 
bell [1993]). Choices is an object-oriented multiprocessor operating system written in the 
C+H- programming language (Stroustrup [1986]). In Choices all system entities and re- 
sources (such as disks, processes, files, memory maps and so forth) are objects, with 
various frameworks specifying interactions among objects within and across different sub- 
systems (Campbell [1992]). The word framework is used in the sense that building a 
framework involves designing software components that operate together. More precisely, 
a framework is the specification of the interactions that are permitted between compo- 
nents and their relationships to each other (Deutsch [1989]). (One well-known example is 
the Model-View-ControUer framework of Krasner [1988] in the SmallTalk-80 programming 
environment.) 

The remainder of this paper is organized as follows. Section 2 describes the architecture 
of the NIF. Section 3 describes how the network hardware drivers and the x-Kernel are 
implemented in Choices using the NIF. Section 4 compares the performance of the NIF 
and the x-Kernel against BSD networking on Ethernet. Section 5 illustrates how the NIF 
can be used to implement a quality- of- service mechanism for ATM, and how performance 
improves with this mechanism. Section 6 discusses the advantages of the NIF over BSD as 
they pertzdn to the design requirements. Finally, the Conclusion summarizes our findings. 



2 ARCHITECTURE OF THE NETWORK INTERFACE 
FRAMEWORK 



The Network Interface Framework (NIF) is an object-oriented framework consisting of 
three classes: NetworkD rivers. Network Buffers and NetworkClients. An overview of the NIF 
is given in Figure 1. NetworkBuffers are exchanged between NetworkClients and Network- 
Drivers. The NetworkDriver acts as a multiplexor /demultiplexor in order to deliver received 
data to the proper NetworkClient or to transmit data. Each NetworkClient must maintain 
its own pool of NetworkBuffers for receiving and sending — buffers are not shared among 
clients. 

The NIF is a realization of the Resource Exchanger design pattern put forth by Sane 
[1996]. This pattern, as applied here, states that one network cfient should not be able 
to starve the others by consuming all of the resources (buffers). Hence the key principle 
for buffer management is “conservation of buffers.” A NetworkClient will be given a newly 
received packet only if it has already given the driver an empty NetworkBuffer for that 
packet. Otherwise the new packet is discarded. Buffers are generally exchanged between 
the driver and the client, but the NIF also supports clients that require data in particular 
buffers, using copying in these cases if necessary. 
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NetworkDriver Network Buffers 

MUX I ^ ** — ^ 



NetworkCHent 

Free receive buffer poot 
Free transmit buffer pool 



Free receive buffer pool 
Free transmit buffer pool 



Figure 1 An overview of the NIF. Network Buffers flow between one NetworkDriver and 
many NetworkClients. 



2.1 The NetworkCHent Class 

The NetworkCHent base class encapsulates a subsystem that wishes to use the network 
hardware, which is represented by a NetworkDriver. Each NetworkCHent interacts with a 
single NetworkDriver. 

A subclass of NetworkCHent defines the receive method to perform some activity when 
a new packet intended for that client arrives. The NetworkDriver class calls this method 
at interrupt time with a NetworkBuffer containing the new packet as the argument when 
it wants to deliver data to a client. The client can process the packet immediately or can 
defer processing to a later time through an internal queue. The NIF does not impose any 
queueing or scheduling policy, so low-latency clients can act immediately if they desire. 

In order to manage buffers efficiently, the NIF requires that clients provide it with 
information about their expected source of incoming packets. Each NetworkCHent must 
export the method getReceiveBufferPoHcy, which returns the client’s receive buffer policy. 
This policy must be one of SwapOrReturn, MustReturn or PeekOnly. The meaning of each 
policy is given below. 

A SwapOrReturn client is indifferent to the source of the buffer that holds the data. 
The standard driver technique is a buffer Swap. The driver takes empty buffers from the 
client’s receive buffer pool in exchange for buffers from a different location that contain 
data destined for the client. An alternate technique is buffer Return, where the driver 
delivers the data in a buffer that the driver earlier obtained from the client’s own receive 
buffer pool. 

A MustReturn client must have the data delivered in one of its own Network Buffers. 
A cbent can use this policy when it has its own subclass of buffer that it must use for 
reception and so cannot swap away for generic NetworkBuffers. 

A PeekOnly client client wishes only to “peek” at the data at interrupt time. The driver 
immediately delivers the data to the client in some buffer owned by the driver. The data 
in the buffer will be valid only for the duration of the czdl to receive, as the driver will 
then reuse the buffer by storing later packets in it after that call returns. An example of 
a PeekOnly client is a client that needs only to accumulate some summary information 
about each incoming packet. PeekOnly clients do not have their own receive buffer pools. 
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Figure 2 An illustration of client receive buffer policies. A solid arrow denotes a transfer 
of buffer ownership; a dotted arrow denotes a temporary loan only. Note that the driver’s 
procurement of an empty buffer can occur before the interrupt handler invocation that 
delivers the new packet to the client. 



Figure 2 depicts these policies. In the first two cases above, the client wiU own the 
buffer with the data after the interrupt handler completes. The PeekOnly case is applicable 
whenever a long-term transfer of ownership is unnecessary. Since the receive method is 
called at interrupt time, there is no process scheduling discipHne imposed by the NIF on 
received packets. 



2.2 The NetworkDriver Class 



The NetworkDriver manages the host adapter for the network. This class is basically a mul- 
tiplexor/demultiplexor for Network Buffers that flow to and from NetworkClients. As such 
it contains a registry, and any client that wishes to use a network interface must register 
itself with the driver. When registering, the cHent provides a packet type that indicates 
which packets the client is interested in receiving. Subclasses of NetworkDriver are free to 
specify the exact meaning of a “type.” A special reserved value, DEFAULT _CLIENT_TYPE, 
indicates the default type, which only one client may register with any particular driver. 
The chent registering this type will receive all packets for which no other client has reg- 
istered. 

The driver also handles the multiplexing of outbound packets from several different 
sources. Methods are provided for synchronous and asynchronous transmission. In the 
simplest case, these methods send the packets out using a first-come, first-serve order. 
More complex strategies can be used by subclasses, such as sorting buffers by some priority 
function and then transmitting higher priority requests before lower priority ones. 
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2.3 The NetworkBuffer Class 

A NetworkBuffer object encapsulates a packet that is being passed between a driver and a 
client. Its use ensures a consistent abstract view of memory by all entities in the NIF. A 
NetworkBuffer represents a packet of data which may or may not be contiguous in memory. 
A buffer has a maximum length and a current length. The maximum length is fixed at 
the time the buffer is constructed, while the current length may be changed as new data 
is put into the object. Methods to copy a buffer to a contiguous memory region or to copy 
from a contiguous memory region into a buffer are also provided. It is also possible to 
iterate over the various contiguous subregions of a buffer, and to make the entire buffer 
contiguous. 

The NIF provides a concrete LinearNetworkBuffer subclass that implements a buffer 
whose data is a single contiguous region in memory. A framework entity (client or driver) 
that requires no special optimizations can simply use this subclass. A client that wants 
data to be delivered in objects of its own subclass of NetworkBuffer will need to use the 
AlwaysReturn receive policy. The data in such subclasses of NetworkBuffer need not be 
contiguous in memory. The interface of NetworkBuffer has methods to determine whether 
a buffer object is contiguous and if so, what the address of the data is. The NIF requires 
a NetworkBuffer to be contiguous only when it is given to the driver by the client during 
received packet delivery. This is easy to accomplish in practice since the buffer must be 
empty and thus in the worst case only memory reallocation need be performed to make 
it contiguous, rather than copying. 



3 USING THE NETWORK INTERFACE FRAMEWORK IN Choices 

This section describes how networking is implemented in Choices with the NIF and the 
i-Kernel. Since the NIF is a proper object-oriented framework, subclassing is used to spe- 
cialize objects for particular hardware and clients. Some brief background on the Choices 
operating system is necessary first, since this section refers to two versions of Choices. 
The native version runs on the SPARCStation 2 among other machines. There is also 
an emulated or “virtual” operating system that is called Virtual Choices. This version 
runs on top of UNIX as a user process and simulates hardware interrupts with signals 
and virtual memory through mmap system calls. Virtual Choices is extremely useful for 
debugging and preliminary development. 

3.1 Drivers: Ethernet and ATM Support in Choices 

The NIF supports Ethernet on the native SPARCStation 2 version of Choices through 
subclasses of NetworkDriver. There is an EthernetDriver abstract subclass which knows 
about Ethernet packet types and Ethernet- specific control operations such as toggling 
promiscuous mode. The concrete SSlAm7990 subclass of EthernetDriver implements the 
SPARCStation 2 Ethernet driver. A NetworkClient can register for all packets with the 
same 16-bit EtherType code; the default NetworkClient will receive all packets no other 
client has explicitly registered for. The SSlAm7990 class manages an internal pool of 
receive buffers located in a special DMA region of memory; any buffer that will be used 
to receive data must be in this memory region. The SSlAm7990 always uses the Swap 
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receive policy whenever possible. Clients who are willing to swap buffers with the driver 
allocate their own buffers from the DMA region to allow this. Buffers are always pulled 
out of the receive ring and given to a cHent in return for an empty buffer if the client 
has a SwapOrReturn receive policy. If the client has a MustReturn policy, the buffer is 
copied rather than swapped, and the original buffer remains in the ring. PeekOnly clients 
merely examine the buffer during the interrupt handler; as mentioned earlier, there is no 
structured way to do this in BSD. 

The NIF supports ATM through a different scheme that exploits the early-demultiplexing 
nature of most ATM interface cards. The abstract ATMDriver subclass of NetworkDriver is 
connection-oriented, unlike the Ethernet driver, and treats each full-duplex virtual circuit 
(VC) as a “packet type.” Each ATM client, when it registers with the NetworkDriver, 
opens a full-duplex connection to the given destination and thereafter receives and sends 
all packets on the VC for that connection. 

A concrete subclass of ATMDriver implements this abstract design for a particular net- 
work card, such as the Fore System 100 and 200 series ATM adapters, but the basic design 
described below is the same for all subclasses. The ATM adapter delivers all packets re- 
ceived on a given VC into a specified region of memory that was earUer programmed into 
the hardware by the host’s operating system when the VC was opened. Therefore, in its 
receive buffer pool an ATM client has a fixed number of buffers mapped into a region 
accessible by the ATM hardware. The ATM driver takes possession of all this memory at 
the time the cbent opens its connection so that the hardware can be notified of its loca- 
tion, and afterwards the driver hands back to the client buffers from this pool containing 
received packets. Clients generally must use a MustReturn policy. A PeekOnly client can 
use an internal buffer pool in the ATM driver set aside for PeekOnly clients. 

3.2 Clients: a:-Kernel as a NetworkClient in Choices 

The oj-Kernel can be embedded into the NIF as one or more clients. The structure of the 
cbent s is influenced by the hardware nature of the underlying driver. We first describe 
the late- demultiplexing cHent for the Ethernet driver (the (LateDemuxClient), and then 
we describe the early- demultiplexing cbent for the ATM driver (the EarlyDemuxClient). 

On Ethernet, there is no notion of a connection at the network bnk level; any connections 
must be recovered by software (hence “late-demultiplexing”). Therefore it is convenient 
to instantiate a single cbent for the entire a;-Kernel. This cbent, the LateDemuxClient, is 
registered for the default packet type (zdl packets not claimed by other cbents), so that 
non-x-Kernel cbents can stiU receive any individual packet types they desire. The Lat- 
eDemuxClient manages a pool of threads which are used to shepherd received messages 
upwards through the a;-Kernel. The LateDemuxClient exports a SwapOrReturn receive pol- 
icy. LateDemuxClient processing at interrupt time is bmited. Its receive method simply 
takes the NetworkBuffer and places it on a receive queue on which the cbent ’s pool of 
a;-Kernel threads is waiting. One of these threads will awaken and dequeue the packet, 
create an x-Kernel Msg over it, and then shepherd it upward through the protocol stack. 
No copying of packet data is necessary to create the Msg. 

The EarlyDemuxClient is designed for early-demultiplexing hardwme such as ATM. 
Early- demultiplexing hardware is aware of link-level connections (VCs for ATM) and 
performs demultiplexing at receive time by putting a packet into a memory location that 
is a function of the packet’s connection identifier. A EarlyDemuxClient represents one VC 
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Table 1 TCP throughput on Ethernet with IK mes- 
sages and 16K socket buffers. 



platform 


receive 


send 


x-Kernel 3.2 in Choices 


950KB/sec 


920KB/sec 


BSD 4.4 in Choices 


1030KB/sec 


980KB/sec 


SunOS 4.1.3 


1080KB/sec 


1050KB/sec 



Table 2 TCP throughput on Ethernet with 4K mes- 
sages and 16K socket buffers. 



platform 


receive 


send 


x-Kernel 3.2 in Choices 


1080KB/sec 


920KB/sec 


BSD 4.4 in Choices 


1060KB/sec 


980KB/sec 


SunOS 4.1.3 


1080KB/sec 


1060KB /sec 



and manages a chunk of memory into which its packets are dehvered. It uses a Must Re- 
turn policy, since the ATMDriver grabs this memory when the client’s VC is opened and 
hands back buffers with received packts. Each Early DemuxClient manages its own pool of 
threads used for shepherding received packets up the i-Kernel protocol stack. This is in 
contrast to the LateDemuxClient, which is not associated with an individual connection. 
Interrupt time processing in the Early DemuxClient is the same as its late-demultiplexing 
counterpart. The primary difference in system behavior arises from the fact that there 
can only be a single LateDemuxClient for any packet type, whereas there can be many 
Early DemuxClients for a given packet type. 

The LateDemuxClient and Early DemuxClient classes are generic clients for interfacing the 
NIF to the I- Kernel. Therefore both classes handle certain utility functions, such as con- 
version between x-Kernel’s Msg objects and the NIF’s Network Buffers. The i-Kernel uses 
a message library providing lazily-evaluated messages implemented as trees of pointers to 
data regions. The XKernelNetworkBuffer, a subclass of Network Buffer, is an NIF “wrapper” 
for an x-Kernel Msg. X Kern el Net work Buffers are used when transmitting so that the driver 
can use any hardware scatter-gather facilities, avoiding an extra copy. 



4 PERFORMANCE COMPARISON OF BSD AND THE x-Kernel 
ON ETHERNET 

This section describes the performance of the Choices implementation of the sockets 
interprocess communication facility (Leffler [1989] and Curry [1989]). The sockets imple- 
mentation in Choices with x-Kernel is fully described in Liao [1995]. The data show that 
the x-Kernel in the NIF performs comparably to the BSD protocol stack, implying that 
moving away from the BSD model of network processing to a process-based model does 
not impose a great penalty. 

User-level performance of the Choices sockets implementation was measured on SPARC- 
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Station 2 workstations with 16 megabytes of RAM on an isolated Ethernet. TCP through- 
put was tested with the public-domain benchmark appHcation ticp. We used TCP to send 
or receive from a reference machine, a SPARCStation 2 running SunOS 4.1.3. The socket 
buffers on the reference machine were set to 32KB to give it plenty of buffer space. SunOS 
4.1.3 has a very fast networking subsystem based on BSD UNIX capable of achieving 
throughput near the theoretical maximum of Ethernet, so it can be excluded from con- 
sideration as a bottleneck in any comparison of throughput. The three operating systems 
tested against this reference machine were Choices with i-Kernel, Choices with BSD 4.4 
networking, and SunOS 4.1.3. The version of Choices with BSD 4.4 networking was pro- 
duced independently at an earUer date for experiments in file system caching, before we 
had elucidated our requirements for the networking architecture. All of these platforms 
have very similar TCP/IP protocol stacks, as the x-Kernel protocol stack is a modified 
version of the BSD UNIX source, and SunOS 4 networking is also based on BSD. The re- 
sults obtained by running the benchmark in user mode on both ends are given in Tables 1 
and 2. 

The tables demonstrate that x-Kernel in Choices has good performance compared to 
BSD networking in Choices. Neither Choices version has throughput as high as SunOS. 
Aside from observing that SunOS is a mature commercial product that has been highly 
optimized, we can offer two explanations for this result. There sore still a fair number of 
spin locks and semaphores in the Choices kernel, as it is designed to be preemptable and 
multiprocessor-compliant, unlike SunOS. Also, both Choices versions were built with the 
freely available GNU C-I--1- version 2.5.8, whose optimizations are not as aggressive as a 
commercial compiler. 

Comparing x-Kernel and BSD networking in Choices is useful since the two are embed- 
ded in the same underlying operating system, which factors out implementation differences 
in areas such as virtual memory. Overall, TCP throughput in x-Kernel is around 5-10% 
less than BSD. Receiving large message sizes is an apparent exception (see Table 2), since 
x-Kernel throughput is the same as the other two platforms. This exception exists only 
because all three operating systems are operatingly close to the maximum TCP through- 
put possible on our network; otherwise BSD and SunOS would probably be a little faster 
than aj-Kernel. The Tables show that the x-Kernel has a higher per-message overhead 
than BSD. 

BSD networking’s minor performance advantage over the u-Kernel is actually smaller 
than shown by the Tables. The ai-Kernel protocol stack is not as optimized as the BSD 
protocol stack. All BSD message routines are inline macros, whereas the x-Kernel data 
manipulation routines are currently implemented as ordinary functions. There are also 
only three layers in the BSD protocol stack (Ethernet, IP and TCP), while there are 
currently five protocol layers in the x-Kernel: one extra layer between IP and Ethernet 
due to the handling of address resolution using ARP, and another due to the separation 
of hardware-independent and hardware- dependent aspects of Ethernet processing into 
different layers. This arrangement is excellent for quick addition of Ethernet drivers for 
new hardware and for creating IP protocol variants. However, for maximum performance 
the protocol stack can be modified and coUapsed into fewer modules to match the three 
in BSD. After these optimizations x-Kernel performance should be very similar to that 
of BSD. 

The point of these figures is not to show that x-Kernel might be faster than BSD, 
but that the NIF can support a rather different model for network protocol processing 




86 



Part Three Communication Systems Architecture 



without suffering great performance losses. After all, one can always implement BSD- 
style networking using software interrupts for receive processing in the NIF by defining 
the appropriate subclasses. We have chosen the x-Kernel instead for continuing work 
because its performance is comparable to BSD and when combined with the NIF it is far 
more flexible for the work we are pursuing in multimedia networking. The next section 
gives more detail on the appHcability of the NIF and the z-Kernel to multimedia and 
quality of service. 



5 SUPPORTING QUALITY OF SERVICE ON ATM 

Quahty of service (QOS) in networks is an important issue in high-speed networking and 
continuous media. The NIF can support QOS through the use of multiple network clients 
connected to a process-based protocol stack such as x-Kernel. In this section we describe 
preliminary work in this area using the NIF to provide operating system support for 
multimedia and quality of service. 

We are now using the NIF to implement quality of service mechanisms (Tan [1996]). 
The platform for this work is fiChoices, the microkernel-based successor to Choices. The 
NIF and all clients are currently still implemented within the microkernel, so the platform 
is virtually identical to the original Choices. Consequently, the actual architecture and 
performance of the NIF on ^.Choices do not differ from the original Choices. 

To illustrate why quality of service is necessary, consider Table 3. This Table gives a 
baseline measurement of presentation jitter on Ethernet when an application is receiving 
10 video frames per second while the host machine is also receiving some background 
TCP streams (representing FTP sessions). The test platform is Virtual ^Choices on a 
Sun Sparcserver 600MP, using the NIF and a LateDemuxClient with x-Kernel for protocol 
processing. Presentation jitter is the variance in interframe arrival time. The Table shows 
that jitter increases as the number of interfering of TCP streams rises. Jitter due to the 
network itself was measured separately to be only 5.2% of the presentation jitter with 3 
interfering TCP streams, so virtuadly all the jitter is due to priority inversion in protocol 
processing. All incoming packets are being processed first-come, first- serve, which means a 
video packet must often wait for some FTP packets to be processed. This is inappropriate 
since each video frame has an implicit deadline which recurs every 100 ms, while an FTP 
packet has no such deadline for its processing. A video frame should not miss its deadline 
simply because it arrived after some FTP packets. 

A quality of service mechanism is necessary at the operating system level in order to 
reduce jitter on the video stream. The background TCP streams are not time- critical, 
whereas the data received on the video stream is time-critical and becomes useless if de- 
layed too long before reaching the application. Therefore processing for incoming packets 
for the video stream must be scheduled differently from processing for the FTP sessions. 
We realize this goal by giving each transport-level connection its own VC, following the IP 
over ATM model (Cole [1995]). Since VCs are not shared between higher-level connections, 
it is then possible to manipulate the thread priorities for the associated Early DemuxClients 
to provide quality of service. 

Table 4 shows the same experiment repeated on ATM, with one EarlyDemuxClient per 
connection. This experiment used the Fore Systems SBA-200 series of ATM adapter 
boards on a Fore ASX-200 ATM network, with network links of 155 Mbit/s OC3C fiber. 
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Table 3 Presentation jitter statistics for a 10 frame per second 
video application in the presence of interfering TCP streams on 
Ethernet, using one LateDemuxClient with all threads having 
the same priority. 



Video Application 
Statistics 


No. of TCP Streams 
0 12 3 


Maximum interframe period (ms) 


103 


114 


119 


130 


Minimum interframe period (ms) 


94 


85 


79 


77 


Interframe variance 


1 


11 


49 


80 



Table 4 Presentation jitter results for the use of end-to-end 
ATM VCI application communication over TCP/IP with one 
EarlyDemuxClient per VC, and static priority scheduling. 

Video Application No. of TCP Streams 

Statistics 0 12 3 



Maximum interframe period (ms) 


102 


105 


119 


120 


Minimum interframe period (ms) 


94 


91 


90 


89 


Interframe variance 


1 


5 


6 


7 


Improvement 

(old variance/new variance) 


1.0 


2.2 


8.2 


14.8 



A simple static priority scheme was used; the threads used by the video stream’s client 
were given a higher priority than those associated with the background TCP streams. A 
great reduction in presentation jitter is evident. Note that the change from Ethernet to 
ATM is not significant for this experiment, since in both experiments network-level jitter 
was insignificant and network bandwidth was nowhere near saturated. 

More sophisticated scheduling algorithms incorporating deadlines will be implemented 
later, but our ATM implementation on the NIF is a clear case of an architecture that can- 
not be emulated with BSD-style networking. Even if the BSD software interrupt handler 
were modified to handle high priority packets first, all incoming packets would stiU have 
to be processed before exiting the handler. Receive-side protocol processing would thus 
retain absolute priority over aU other normal software activities in the system. The NIF 
ziUows protocol processing to be intermingled with other system activities by allowing 
the use of a process-based client like x-Kernel. This flexibility in scheduling is needed to 
handle packets on different streams with differing degrees of urgency in the outbound as 
well as inbound directions. 

Future work wiU focus on refining this approach to control jitter and delay for multi- 
media network traffic on ATM, using the EarlyDemuxClient-per-VC approach. Each client 
enqueues received data on its respective queue, but the realtime scheduler determines 
which thread(s) to run next based on the information given by the realtime clients. One 
applicable realtime scheduling discipline is the DeadHne/Workahead Scheduling of An- 
derson [1990], where realtime processes are critical if they have unprocessed messages 
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that are within a deadline based on the message’s logical (not actual) arrival time. An 
enhanced Early DemuxClient, when invoked by the driver at interrupt time, can check to 
see if the newly received message would cause the shepherd thread for that packet to go 
critical. If so, the client increases the priority of the process responsible for handUng this 
message to critical. 



6 DISCUSSION 

As we have mentioned earlier, the NIF has two benefits over BSD-style networking: lower 
latency between reception and client processing, and less restrictive client design. This 
section elaborates further on these advantages. 

The NIF has a lower latency between the time it receives a packet and the time a 
client begins processing, due to its multiple cHent support and policy-free structure. The 
BSD hardware interrupt handler places all received packets on a per-protocol queue cor- 
responding to the packet type, and it then posts a software interrupt to perform the 
protocol processing. Thus the lower bound on packet latency consists of the following: 
determining the packet type and locating the per-protocol queue, enqueueing the packet, 
posting a software interrupt, ending the hardware interrupt handler and starting software 
interrupt processing. In reality, other packets may be ahead of this one on the queue so 
that the real latency is unpredictable and not within the cHent’s ability to influence. The 
NIF’s latency only includes determining the packet type and locating the client, in all 
cases. Even though it is often necessary to schedule a real process to do actual protocol 
processing, it is vital to have the opportunity to perform some activities at interrupt time. 
For example, the ATM implementation will eventually manipulate process priorities at 
interrupt time by communicating with a deadline scheduler. For this approach to work, 
the scheduler must receive notice of events such as packet arrivals as soon as possible after 
they occur. 

The NIF is much more flexible in client design than BSD since it does not require queue- 
ing or the software interrupt model. Of course, the BSD model can be implemented in the 
NIF, but with multimedia operating systems a process-based system is preferable. In BSD 
aU received packets are processed not by true processes, but by a single pseudo-process, 
the softwzure interrupt handler. The original motivation for using software interrupts was 
to avoid an expensive process switch, since all processes were heavy-weight and each one 
ran in its own address space. In modern OSes like Choices this motivation no longer holds, 
since threads are supported at the kernel level and the overhead of a thread switch is not 
much greater than invoking a software interrupt handler. Threads offer the advantage of 
being adjustable in priority. The system designer can manipulate process priorities and 
deadlines to implement quality of service guarantees for network processing. Dynamic 
scheduling of this sort is just not possible with softwzure interrupts. Even if the softwEire 
interrupt handler had its own internal priority scheduling, the protocols in BSD are pas- 
sive and so cannot be called in the hardware interrupt handler to interrogate them for 
scheduling information. 

A further design restriction in BSD network clients is that the software interrupt handler 
cannot block, since this would have the undesirable side effect of also blocking the process 
that was running when the software interrupt was taken. Since it is not possible to block 
the software interrupt handler, the programmer must explicitly save aU required state 
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when later processing is needed. Programming with processes where state is encapsulated 
in the stack is more convenient and flexible for the implementor. Another important ad- 
vantage of a process-based system is that multiple processes can be concurrently executed 
and synchronized with each other in a multiprocessor system. Multiple software interrupt 
handlers executing concurrently cannot coordinate their activities since they must process 
all queued packets in a flxed order. 



7 CONCLUSION 

In summary, the NIF fully supports the requirements for flexible network services in an 
operating system, whereas BSD does not. The NIF provides an architecture for allowing 
independent network clients to access network hardware. It satisfles the requirement that 
the clients’ buffer handling policies not interfere with each other through the Resource 
Exchanger paradigm. It also allows clients the opportunity to manipulate and react to 
received data at interrupt time, as the clients are active objects that share the network 
hardware. Most importantly, the NIF does not force any particular structure onto client 
software. This requirement is the major failing of BSD, which tightly binds its clients to 
its model of network services involving software interrupts and queueing. Moving away 
from this model does not impose significantly greater performance overheads; the NIF 
supports a process-based protocol stack (the x-Kernel) with performance comparable to 
BSD’s. 

The NIF is not tied to any one design for client software, and its ability to support 
process-based solutions such as the x-Kernel gives it greater flexibility than traditional 
BSD. New features of network hardware such as early-demultiplexing can be exploited by 
properly-designed NetworkClients. The NIF can support a quality of service mechanism 
for ATM through the use of one client per connection and static scheduling. Our results 
show that this mechanism reduces jitter on a realtime data stream in the presence of 
background, interfering non-realtime streams. This scheme can be generalized so that 
each client provides feedback to a QOS module that uses process priorities and deadlines 
to improve QOS even further. Our future work will involve building on top of the NIF 
and continuing to develop it to support quaHty-of- service and high-speed multimedia. 
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Abstract 

These latest years, technological improvements in Computer and Telecommunication areas led to the 
emergence of a new kind of distributed multimedia and co-operative applications, whose features and 
requirements are both multiple and diversified. In order to tackle these needs, two major architectural 
approaches are currently pursued, one of them arguing for an improvement of the application software, 
the other one promoting a more sophisticated network support. Following this last approach, this pa- 
per aims at providing implementation results performed around the design of a multimedia Transport 
architecture based on the partial order connection (POC) concept. Design principles of a multimedia 
partial order Transport connection (MM-POC) are first introduced. Main mechanisms of the associated 
protocol are then detailed and implementation investigations using the STREAM concept are exposed. 
Experimentation results are finally given. 
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1 INTRODUCTION 

During the last decade, technological improvements in both Computer and Telecommunication areas 
led to the emergence of a new kind of distributed multimedia and co-operative applications, including 
transmission and process of all data types (text, fixed image, voice, video sequences, etc.). Requirements 
of such applications are something new, particularly: 
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• high speed transmissions are considered to be the first of these needs ; 

• temporal constraints have to be enforced when interactive communication is required ; for instance, 
real time transfer is needed to make it acceptable an interactive visioconferencing system ; 

• exchanged data have to be orchestrated the ones with the others, this third point raising up what is 
commonly called a multimedia synchronization issue ; 

• finally and at the opposite of file data transfers, distributed multimedia applications may tolerate an 
imperfect transfer of the data they involved, without being unacceptable from users point of view. 

Let us note that this list is not complete and may be considered as a subset of all needs present and fu- 
ture applications are going to require. A full analysis of distributed multimedia applications requirements 
may be found in [30]. 

Current solutions and their limits 

The emergent generation of high speed networks now correctly address the need in high throughputs. 
Recently, lots of studies have been performed around the design of light weight Transport protocols 
(such as NETBLT [8], VMTP [4] or XTP [20]) more suited to support multimedia data transfers than 
TCP or TP4 protocols ; however, these proposals do not provide a sufficient enough solution regard 
with multimedia requirements complexity. Especially, multimedia synchronization issues (both spatial 
and temporal ones) are not addressed and temporal constraints management mechanisms are strongly 
implementation dependent. 

A new generation of high speed, multimedia and co-operative protocols has then to be designed on 
top of high speed links, either starting from the existing solutions, ISDN or Internet protocols, or on top 
of new ones, ATM-like one. Presently, several communication architectures are being envisaged within 
different projects that may be classified as follows. 

The ” Application Aware Networking” approach 

Up to now, two major architectural approaches have been proposed to design new distributed communi- 
cation systems ; both extend the Application Layer Framing (ALF) concept introduced in [9] that argues 
the interest of Application Data Units (ADU) preservation at the communication system level : 

• the first approach assumes that the network software has to be as simple as possible without providing 
any Quality of Service (QoS) guaranty. The application is supposed to be the best to adapt to network 
fluctuations by integrating in its software most adequate transport mechanisms. This Network Aware 
Application (NAA) approach is mainly supported by the Internet community, and is currently pursued 
within the Esprit project HIPPARCH. Recent investigations and results are given in [7] [18] [17]. 

• conceptually different from the NAA approach, the second approach suggests to increase networking 
system complexity so as to handle new requirements defined at the applicative level. The major interest 
of this approach is to make it simpler the user software, at the cost of a more sophisticated network 
support. This Application Aware Networking (A AN) approach has been or is currently pursued within 
several projects : OSI 95 [11], RACE CIO [19], BERKOM [12], QoS-A [3], TENET [28] and CESAME 
[15]. 
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Paper Content 

Starting from the partial order connection (POC) concept [1] [14], work described in this paper follows 
the AAN approach and is related to the implementation results performed around the proposal of a 
new multimedia Transport architecture, the Multimedia Partial Order Connection (MM-POC), including 
’order’ and ’reliability’ as full QoS parameters [6] [5]. 

The following of the paper is divided into four major sections. Design principles of a MM-POC are 
introduced in the first one (section 2) ; their suitability to tackle both synchronization issues and temporal 
constraints is shown using the Timed Stream Petri Net (TSPN) model [15]. The associated multimedia 
Transport protocol is detailed in section 3 ; protocol data unit (PDU) structures are presented, then main 
mechanisms are detailed. Based on the STREAM implementation results concept are developed in section 
4 ; it is first shown how the STREAM concept is particularly well suited to implement the AAN approach 
; the implementation architecture is then detailed and algorithms are provided. Experimentation results 
are Finally given and analyzed in section 5. 



2 MM-POC: A MULTIMEDIA TRANSPORT ARCHITECTURE 
2.1 The Partied Order Connection concept 

Currently used Transport protocols are either based on the connection-oriented (CO) paradigm or on the 
connection-less (CL) one : 

• on one hand, CO protocols (TCP like one) provide their users with full reliability and total order, at 
the cost of increased delay and reduced throughput ; 

• on the other hand, CL protocols (UDP like one) introduce no increase in transit delay or reduction in 
throughput but provide neither order nor reliability guarantees. 

This classification makes it clearly appear a design gap between CO and CL protocols that suggests a 
conceptual link of the classical connection concept. This extension, the Partial Order Connection (POC), 
for which CO and CL connections appear to be special cases, has been introduced and theoretically inves- 
tigated in [1] and [14]. Basic principles of this recent concept are reminded in the following paragraphs. 

A conceptual extension of the connection concept 

A POC is an end-to-end connection that allows its users to define and use for transferring data any 
partially ordered/partially reliable services from no order/no reliability (UDP-like service) to total or- 
der/total reliability (TCP-like service). In a POC, ’order’ and ’reliability’ appear as two specific QoS 
parameters specified by the service user. In a POC, service data units (SDU) can be delivered to the 
receiving user in an order different from the sending order : the acceptable difference between the sub- 
mission sequence and the different delivery sequences precisely results from the definition of the selected 
partial order. 

A suitable concept with regard to multimedia applications features 

Aimed at providing a formal representation of temporal constraints in distributed systems, several models 
have been proposed these latest years, most of them being based on the Petri nets formalism [29] [21] [2]. 
One of this model, the Timed Stream Petri Net (TSPN) model [16] [26], has been designed to formally 
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describe multimedia synchronization scenarios in asynchronous distributed systems*. In this model, a 
place is associated with the presentation of a data (still image or sound fragment for instance) ; logical 
dependency relationships existing either within a flow or between the different flows of the modeled 
application are expressed using transitions of the net. The Petri net deduced from a TSPN model provides 
a logical representation of temporal synchronization constraints, that appears to be a specific partial order. 

Consider for instance the multimedia object pictured on the left part of figure 1, composed of different 
monomedia objects (a logo, two fixed images and two video sequences) numbered from 1 to 99 ; the picture 
provides the expected object disposition at the receiving side. The right part of the figure provides the 
partial order, say P, deduced from the application defined TSPN model, that illustrates (among other) 
that logo 1 and image 2 may be distributed at the receiving side independently the one with the other 
; logical synchronization constraints between the two video sequences are expressed by the intermediate 
transitions. 




Figure 1 Partial order deduced from a TSPN model of a multimedia object 

Provided with a P partial order Transport connection, receiving user may then be delivered monomedia 
objects (typically service data units) in any sequences consistent with both spatial and temporal synchro- 
nization constraints, these different delivery sequences leading to transfer speed-up and save resources at 
both sending and receiving sides. [1] and [14] illustrate this last point through multiple examples and two 
different theoretical analysis. Other works performed around the POC concept may be found in [10] and 
[ 22 ]. 



2.2 MM-POC eirchitecture 

From an application point of view, it clearly appears that high speed Transport protocols will be used 
for multimedia communication. However, as multimedia applications requirements are multiple and di- 
versified with regard to the kind of data flows they imply, it is now necessary to define different transfer 
characteristics for each of these flows. For instance, assume an application be composed of partially syn- 
chronized text and video flows ; Transport service has to guaranty both a perfect reliability with respect 
to the data flow transfer, and a sufficient enough throughput with respect to the video one ; however, 
a totally reliable Transport connection is not needed for the video flow and a high speed transfer is not 
a major requirement for a data text-like communication. Moreover, one can mentioned that the use of 
a reliable service implies transmission latency, often inconsistent with an acceptable high speed video 
distributed application. 

* Other works showing TSPN usefulness in asynchronous operating systems may be found in [23] 
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Addressing this point, most recent research works led to either an extension of the QoS concept [11] 
[13] or the proposal of new communication architectures. 

As far as this latest point is concerned, two kinds of architecture have been developed : 



• the first one provides its users with a given set of service profiles, each of them being able to handle 
requirements of a specific [28]; 

• the second one offers a Transport interface whose parameters (throughput, transit delay, jitter) have 
to be specified and then negotiated for each flow between service users and providers [3] [12]. These 
works make largely use of investigations around the QoS concept performed in [11]. 



Although pursuing the A AN approach, both approaches do not tackle multimedia synchronization is- 
sues at the Transport level : QoS parameters are defined for each flow, but none of them takes into ciccount 
dependency relationships between these flows. Our Transport architecture differs from previous ones on 
this specific point and is based on the use of the TSPN model at different levels of the communication 
system, particularly at the Transport level. 



2.3 Design principles of a MM-POC 

In order to take into account the ’’multi-flow” aspect of a multimedia application, it clearly appears the 
need for a multimedia Transport architecture providing a set of QoS, each of them being dedicated to 
the different flows of the application. 

In our architecture, a multimedia Transport connection (called a MultiMedia Partial Order Connection 
(MM-POC)) implies the establishment and then a specific co-ordination, at the Transport level , of several 
monomedia connections, each of them providing a distinct QoS. 

Consider the MM-POC given figure 2 : three monomedia connections with a given QoS have been 
established, each of them providing a transport support for a specific data flow (for instance, a video, an 
audio and a text-like data flow) . 




Figure 2 Multimedia Partial Order Transport Architecture 



Let us now detail the link between the POC concept and the basic designs stated in the previous 
paragraph ; in other words, consider how ’order’ and ’reliability’ parameters can be handled in the 
proposed architecture. 







96 



Part Three Communication Systems Architecture 



Order management 

Partial order is managed at two different conceptual levels : (1) within each of the monomedia connections 
(being then specific a monomedia POC) and (2) between these connections. Such a management takes 
into account at the Transport level both intra- and inter-flow logical synchronization constraints, as they 
are deduced from the TSPN model. 

Consider for instance the multimedia partial order given on the right part of figure 1 , supposed to be 
deduced from a TSPN model of the object illustrated on the left part. As each color is used to indicate 
a specific QoS requirements set, the associated MM-POC will then be made of four monomedia POCs, 
each of them being a Transport support of the same colored objects (typically SDUs). 

In this example : 

• SDU 6 has to be delivered after SDU 4 is an example of intra-flow dependency relationship; such 
constraints are managed within the corresponding monomedia POC ; • 

• SDU 6 has to be delivered after SDU 5 is an example of inter-flow dependency relationship; such 
constraints have to be managed at a higher conceptual level than the POC one, still within the MM- 
POC. 



It then appears that a multimedia partial order Transport connection allows its users to select the most 
suitable multimedia partial order service, with regard to temporal synchronization constraints expressed 
by the application t. 

Reliability management 

In a monomedia POC, the protocol has not to recover all PDU losses when the resulting lost SDUs do 
not generate a degradation of the selected reliability. [5] shows how reliability may be managed in two 
different manners, issuing in eaich case transit delay improvement for an acceptable reliability decrease. 
Both error control mechanisms are based on the following rule : 

’’delivery of a given SDU makes obsolete all SDUs that are not yet delivered (they can be lost or not) 
preceding it in the multimedia partial order”. 

When processing an out of partial order SDU (i.e. not deliverable regard with the multimedia partial 
order) the protocol is then allowed to deliver it if the number of SDUs made obsolete does not make the 
maximum loss level exceeded on each POC. 

For instance, assume reliability be expressed by the maximum number of consecutive SDU losses, say 
ki, the service user may tolerate on POCi for i from 1 to n (the MM-POC being here supposed to be 
composed of n monomedia POCs) . 



Media per media reliability management 

When processing delivery of an out of partial order SDU (say A) on POCi, the protocol will deliver A 
if PDUs made obsolete on POCi are still fulfilling reliability requirements ; that is (with our reliability 
definition assumption) when the number of consecutive valid SDUs (i.e. neither delivered nor lost) made 
obsolete does not exceed A:,-. However, it cannot deliver it if this delivery would need the loss declaration 
of one or more valid SDUs on any of the other POCs ; a media per media reliability is thus defined. 



^Due to implementation and real management difficulties, temporal synchronization is supposed to be 
managed at a higher level than the Transport level. 
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Per group of media reliability management 

Generalizing the previous approach, the second mechanism may now deliver SDU A even if it generates 
a tolerable number of losses of one or more valid SDUs on any POC, that is (still with our reliability 
definition assumption) when the number of valid consecutive SDUs made obsolete on POCj does not 
exceed kj for j from 1 to n. This reliability management is said to be per group of media. 

It then clearly appears that both mechanisms induce transit delay improvement at the cost of an 
acceptable reliability decrease either on one POC (with a ’’media per media” error control) or on all 
POCs (with ’’per group of media” error control) still enforcing both reliability on each POC and the 
selected multimedia partial order. Independently of their implementation complexity and the processing 
time overhead they generate, both mechanisms may be analyzed as follows : 

• when a ’’per group of media” error control is applied, transit delay is optimized on each POC at the 
cost of a maximal but acceptable reliability decrease ; 

• differently, a ’’media per media” error control does not fully benefit from the partial reliability concept, 
but independence between POCs is preserved. 

Let us now consider the implementation aspects of a multimedia partial order Transport connection. 
The associated MM-POC protocol is first described (section 3). 



3 MM-POC PROTOCOL 

The MM-POC protocol whose main PDUs and mechanisms are introduced in this section, aims at to 
ensure continuous media (video or audio-like) Transport support. Two goals have been simultaneously 
pursued in the protocol design: flow continuity preservation and retransmission delay decrease. 

3.1 PDU description 

Four different kinds of PDU (see figure 3) have been defined, each of them being identified by a Type 
field: 

1. User data are carried over DATA type PDUs composed of three sub fields: 

• a Payload field whose size may be variable; 

• a sequence number. Number, used to identify the carried SDU; 

• an Ack field, allowing the sending entity to call for an acknowledgment from the receiving side. 

2. ACK-REQ type PDU is a short packet also used to request an acknowledgment from the receiving 
side. No user data can be conveyed within an ACKJREQ PDU. 

3. ACK_RES type PDU is used as a response to an acknowledgment call; it is composed of three sub 
fields: 

• a last received identifier of the last received SDU; 

• a losses list list of lost PDUs that have to be transmitted again; 

• the list size size of the previous list; 
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4. FNACK type PDU has the same structure as the ACK-RES PDU but is sent at the own receiver 
initiative to provide the sender with the list of PDUs it must transmit again. 

Let us now look at the use of these packets within the protocol mechanisms described in the following 
section. 



Type: 

DATA 


Number 


Ack Payload 


Type: 

ACK_REQ 




Type: 

ACK_RES 


List size 


Losses list 


Last 

received 




Type: 

FNACK 


List size 


Losses list 


Last 

received 



Figure 3 PDU structure 



3.2 Protocol mechanisms 

Order management 

Order is differently managed at the sending and the receiving side. The sending entity is in charge to 
make a suitable numbering of sent PDUs regard with the application defined multimedia partial order. 
Receiving a DATA PDU, the receiving entity has to verify if the payload (typically a complete SDU) 
is deliverable with respect to the partial order; undeliverable SDUs are then processed by the reliability 
management mechanism. 

Reliability management 

Having to process an undeliverable SDU, reliability management mechanism first checks if the data may 
be delivered at the cost of “acceptable losses” (see section 2.3). If the data (supposed to be related to 
POC,) is not deliverable regard with POC, reliability QoS, then a FNACK PDU is sent by the receiving 
entity, that makes the sending entity to retransmit unacceptable lost PDU(s). This mechanisms is similar 
to the one defined in XTP [20] . 

Two other goals are also simultaneously pursued : flow continuity preservation and retransmission delay 
decrease. The proposed mechanism is based on a sending window management whose size is a function 
of a Round Trip Time (RTT) measure ; it can be divided into two parts. 



1. In order to keep up with the flow continuity and to process data retransmissions as soon as possible, 
the sending entity periodically asks for an ACK_REQ PDU by setting the Ack bit every n sent 
DATA PDUs. In order to avoid useless retransmissions, the request period must be greater than 
the RTT value, that makes n to be expressed as follows : n = RTT ♦ PF where PF (Production 
Frequency) indicates the frequency the objects are produced by the sending user. 

2. No timer being associated with the previous DATA whose Ack bit is set up, loss recovery of these 
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DATA is not ensured: as a result, a close of the sending window may occurred, no free sending buffer 
being available. In this case, the sending entity has to set up the Ack bit of the last sent DATA 
PDU and to manage an associated timer ; while no ACK-RES PDU has been received from the 
pair entity, an ACK_REQ PDU is sent each time the timer expires. Timer value has been fixed at 
1.1 ♦ RTT. To make it efficient such a mechanism, the sending window has to be greater than 2 * n 
; a simulation study allowed us to conclude a 4 ♦ n value was the most suitable regard with the two 
pursued purposes [24]. 



4 IMPLEMENTATION 

MM-POC implementation choices are described in this section. The implementation architecture is based 
on the Streams concept [27] whose main principles are now recalled. 



4.1 Streams concept 

The Streams concept recently appears with the UNIX System V Release 4 norm [27]. This concept allows 
its users to directly program at the kernel level, inserting or removing Streams modules. A module may 
be seen as a level in a ISO layered structure of protocols [25]. 



Interest 

The AAN approa<:h decreases the applicative software complexity at the cost of a more sophisticated 
communication software; this implies that the application must use a generic Transport service such as 
our Partial Order Transport one. The Streams mechanisms permit kernel insertion of optional modules, 
and is particularly adapted to the AAN approach for the two following reasons: 

• First, the layered structure is respected, each level performing its own functionalities using the under- 
lying service. 

• Second, Streams modules location within the kernel makes it possible to take advantage of kernel 
priority in the scheduling. 



Streams mechanisms 

The notion of stream, or bidirectional (read and write, also called input and output) channel is the major 
point of the Streams concept. A stream enables service users to dialog with the driver that supplies the 
service. Communication is done through messages that are exchanged in both directions of the stream. 

Additional modules (optional processing units) can be inserted (pushed) then removed (popped) at 
any time on each stream. They are used to perform additional processing on messages carried on the 
stream, without having to modify the underlying driver. Within a driver or a module, messages can be 
either immediately processed by the interface functions that receive them {put() routines), or queued in 
the queues associated with the stream (read and write queues) and then processed asynchronously by 
srv() routines. 
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4.2 Implementation architecture 

Protocol implementation is realized with a Stream module; figure 4 shows the protocol architecture with 
two monomedia connections. It is important to note that the internal architecture is not symmetrical. 
This asymmetry is caused by the Partial Order Manager (POM) whose management is only performed 
at the receiving side. We suppose here that the application sends SDUs in an order matching the selected 
partial order: it is then not necessary to manage any order at the sending side. 




Figure 4 Streams Architechture 



At the sending side, each POCJS,- is an instance of a generic Stream module in which PDU numbering 
and PDU retransmission mechanisms are implemented. 

At the receiving side, each POC-R, is an instance of a generic Stream module implementing POCJS, 
mechanisms, but also order and reliability management mechanisms. This latest management, related 
to the co-ordination of the different POC-R, within the communication software, is performed using a 
shared data structure illustrating order and reliability multimedia parameters (grey part on figure 4) 
The module has been first developed over an IP driver, then on an ATM one. 

4.3 Algorithms 

Two main algorithms are started up within the protocol, the first one at the sending side, the other one 
at the receiving side. 

Sender algorithm (see figure 5) 

Two different kinds of events may occurred at the sending side : a data submission from the sending user 
(I) or an ACK-RES / FNACK type PDU receive from the pair entity (2). 

1. Having to process a user data, the sending entity performs the following processes : 

• DATA PDU numbering ; 

• storage in the sending buffer ; 

• set up of the Ack bit if necessary ; 

• DATA PDU sending ; 
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Figure 5 Algorithm at sending side 



While the sending window is locked (when the sending buffer its full), a timer is managed at the 
sending side and an ACK_REQ PDU is sent each time the timer expires ; the timer is stopped when 
an ACK_RES or a FNACK PDU is received. 

2. Receiving an ACK-RES or a FNACK PDU, the sending entity processes data retransmissions. 

Receiver algorithm (see figure 6) 




Figure 6 Algorithm of receiving side 

Two different kinds of events may occurred at the receiving side : an ACK-REQ or a DATA type PDU 
receive from the sending entity. 
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1. Receiving an ACK_REQ PDU, the receiving entity sends an ACK_RES PDU indicating the se- 
quence number of the last received DATA PDU and the list of DATA PDU to be retransmitted. 

2. Receiving a DATA PDU, the receiving entity performs the following processes: 

• ’Monomedia’ order and reliability constraints are first verified: 

- if the received DATA PDU does not satisfy both constraints - i.e. when it is not deliverable 
regard with the monomedia partial order and might be delivered only at the cost of unacceptable 
losses on the POC it is related to - then a FNACK PDU is sent and the DATA is stored in 
the receiving buffers ; 

— else, ’multimedia’ order and then reliability constraints are checked: 

♦ if the DATA cannot be delivered regard with the multimedia partial order, the receiving 
entity tries to deliver it at the cost of acceptable loss(es) : in this case, the DATA is 
delivered and loss(es) notification is provided to the receiving user through an indication 
service primitive ; undeliverable DATA are stored in the receiving buffers ; 

* else, the DATA is delivered. 

• After each DATA delivery, the receiving entity has to check its buffers : stored DATA that can 
now be delivered regard with both monomedia and multimedia constraints are delivered to the 
receiving user ; 

• If the Ack bit of the received DATA is set, an ACK_RES PDU is sent by the receiving entity 
only if a FNACK PDU has not been sent during the first step of the process. 



5 EXPERIMENTATION AND MEASURES 

The following study aims at evaluating our MM-POC protocol implementation in a critical network 
environment that generates both losses and inter-flow desynchronization. In order to enforce multimedia 
order / reliability QoS parameters, several mechanisms have been designed and two different approaches 
have been theoretically investigated in section 5.1 and 5.2 : ’media per media’ and ’per group of media’ ; 
their impact on the transit delay variation is now evaluated and compared, taking into account the three 
following parameters: order, reliability and the loss rate generated by the network. 

Experimental results provided here have been performed over a multimedia connection ensuring a 
transport support for an application involving real time distribution of two different continuous media, 
such as audio and video ones. Measures have been made using our IP/Ethernet LAN with SUN SPARC 
stations 10. Due to the weakness of the loss level observed on the LAN, artificial losses have been 
introduced using a dedicated stream module pushed between the driver and each POC. In this module, 
the ’loss’ parameter is expressed by the rate of discarded PDUs over all sent PDUs. In the following 
measures, this rate is fixed at 10% in order to make it clear the interest of our architecture in a critical 
environment. 

The partial order given figure 7 is supposed to express the logical synchronization constraints deduced 
from a TSPN modeling a visio-conferencing system. 

Objects of the connection 1 are periodically sent all the 40 ms, objects of the connection 2 are period- 
ically sent all the 80 ms. 
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Figure 7 Partial Order 



5.1 Media per media management 

Experimental results given on figure 8 may be interpreted as follows : 

• while the selected reliability (i.e. the acceptable loss number on each POC) is lesser than the network 
loss rate, retransmissions are performed by the protocol that implies an increase of the average of 
transit delay variation (ATDV) ; 

• when reliability becomes greater than the network loss rate, retransmissions number drastically de- 
creases and the ATDV (around 4 ms on figure 8) may be considered as resulting from the network 
jitter and the asynchronous feature of the access control CSMA/CD protocol. 



5.2 Per group of media management 

Compared with results given on figure 8, experimental measures provided figure 9 make appear an anal- 
ogous global behavior of the two different curves. 

Note that when the reliability is greater than 10 %, the ATDV does not present the dispersions observed 
in the ’media per media’ management ; this fact is in agreement with the theoretical comparison of both 
mechanisms that argues for a smaller retransmissions number (at the cost of acceptable loss declarations) 
when a ’per group of media’ mechanism is applied. 




Figure 8 Media per media management Figure 9 Per group of media management 
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These results also make appear that a ’per group of media’ implementation is no more expensive than 
a ’media per media’ one as far as the processing time is concerned. 

In order to point at the ’per group of media’ management interest, a second behavior has to be 
considered : the inter-flow desynchronization. Results and analysis are provided in the next section. 

5.3 Management with desynchronization 



P« grouD of mod* managomml Modta p«r modi* managomont 





Figure 10 Management with desynchronization 

As inter-flow desynchronization can’t be significantly observed on our LAN, an artificial 20 ms late has 
been introduced in each object sending related to connection 2. Experimental results given on figure 10 
make it clearly appear a stronger ATDV decrease when a ’per group of media’ management is applied. 

This phenomena may be easily explicated; when a ’per group of media’ policy is applied, the protocol 
is allowed to deliver data related to flow 1 at the cost of acceptable loss(es) on the late data related to 
flow 2, this making both flow 1 and flow 2 ATDV be significantly decreased. When a ’media per media’ 
policy is applied, such loss declarations are not allowed and SDUs related to flow 1 have to wait for flow 
2 SDU delivery before being delivered. 



6 CONCLUSION 

This paper first recalled the basic principles of a newly introduced multimedia Transport architecture 
based on the partially ordered / partially reliable connection concept (POC). This architecture follows 
the Application Aware Networking (AAN) approach in the design of distributed multimedia applications 
that argues in favor of a more sophisticated communication software, ensuring a user defined QoS. The 
major part of the paper consists in implementation results performed around a multimedia partial order 
Transport protocol whose main mechanisms are introduced here. Particularly, a comparative study of 
two different multimedia order / reliability management approaches is provided in the paper. 

A multimedia partial order Transport connection (MM-POC) implies establishment and then a specific 
co-ordination, at the Transport level, of different ’monomedia’ partial order Transport connections (POC), 
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each of them providing a suitable QoS with regard to the conveyed flow features. Monomedia POCs 
co-ordination, that consists in partial order / partial reliability management between POCs, has been 
envisaged in two different ways : connection per connection (i.e. ’media per media’) and per group of 
connections (i.e. ’per group of media’), both approaches ensuring a QoS guaranty as far as order and 
reliability parameters are concerned ; a theoretical comparative analysis has been given, arguing that 
transit delay is improved when a ’per group of media’ policy is applied. 

Main MM-POC protocol mechanisms and PDUs have been described in the second part of the paper. 
Designed in order to provide a continuous media Transport support, the MM-POC protocol aims at 
to ensure flow continuity preservation, fast retransmissions and bandwidth saving, still ensuring both 
multimedia order and reliability QoS parameters. 

In the third part of paper, the MM-POC implementation architecture has been detailed. This archi- 
tecture is based on the ’Streams’ concept that provides a particularly well suited framework to the A AN 
approach. The integration of both order and reliability management led to the presentation of two efficient 
algorithms. 

Finally, an experimental study has been proposed, implying a typical visioconferencing system traffic 
on a network generating losses and inter-flow desynchronization. It effectively appeared that a ’per group 
of media’ reliability management was more efficient than a ’media per media’ management with respect 
to the average transit delay variation value. 

This experimental work must be followed in two axis: comparison with the Network Aware Application 
approach and investigation on a WAN using IP networking or ATM native networking with actual losses 
and desynchronization. 
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Abstract 

Packet level forward error correction can be implemented by transmitting M re- 
dundancy packets after each set of N regular packets, so that all packets can be re- 
constructed if at least N out of N M are received. The idea is usually dismissed 
because the gain is not worth the additionnal transmission overhead and the increased 
computation load, but further analysis shows that this dismissal may be questioned. 

There are in fact at least three environment where the reduced error rate proves 
very valuable. When multicasting data toward large groups, even a small individual 
error rate per recipient may result in large retransmission rates for the whole group 
and the use of redundancy will result in dramatic efficiency gains. In the case of long 
transmission delays, the use of redundancy helps maintaining the delivery delays within 
acceptable limits, even in presence of errors. When the receivers do not have enough 
memory resources to implement sophisticated retransmission techniques, forward error 
correction can compensate the relative inefficiency of cheap algorithms of the go- back 
N family. 

Packet level forward error correction is not very difficult to implement. The level 
of redundancy can easily be tuned as a function of the network’s characteristics. The 
additional robustness obtained through packet level redundancy helps implementing 
feedback control algorithms. In short, this seldom used technology could easily improve 
the performance of current transmission control protocols. 
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1 Zero defect and networking 

Although packet level FEC has been proposed by many researchers, the idea is usually 
dismissed because the gain is not worth the pain. Implementing complex packet redundancy 
code in software is deemed difficult. It is costier than a checksum, at least as expensive as 
a memory move, thus likely to nearly double the computational load of transport control 
protocols such as TCP. Then, in the case of point to point transmission, one can argue that 
a selective acknowledgement procedure is always more efficient, transmission wise, because 
selective retransmission inserts exactly the level of redundancy that is actually needed. 

But modern application often involve more than two parties. In their communication [1] 
to the 1995 Sigcomm conference Sally Floyd and her coauthors present their reliable multicast 
framework and explain why “repair request and retransmissions are always multicast to 
the whole group.” A consequence of this architectural decision is that the overhead of 
retransmissions will grow with the size of the group. In section 3 we will detail this effect 
and show how it can be alleviated by packet level FEC. 

The reduction in the packet level error rate may in fact be beneficial for point to point 
exchanges, when the transport protocol does not exhibit the near-perfect characteristics. 
This is the case for example of simple TCP implementations, which rely on time-outs to 
cope with certain forms of errors. We will discuss this aspect in section 4. 

Even when perfect retransmission procedures are available, the correction of errors result 
in long transmission delays, specially when the application requires that data be re-ordered 
before processing. Section 5 explains how packet level FEC results in better delays and is 
thus suitable for delay sensitive applications. 

Before exposing the advantages of packet level FEC, we will first examine how it can be 
implemented and used in practice. We will later re-examine the implementation problem 
after exposing the main advantages of the solution, showing how the level of redundancy can 
be adapted to the network configuration and status. 

2 Packet level forward error correction 

Bit level error correction operates on sequences of bits. It comes in two variations, block 
codes and convolutional codes. In block codes, a set of N input bits is completed by M 
redundancy bits. Out of the codes which can be encoded with N M bits, only 2^ 

are allowed. When the block is received, the redundancy is used to find out the allowed code 
which best matches the received pattern, thus removing errors if at least N bits out oi N-\-M 
were received correctly. In convolutional codes, each transmitted signal is a function of the 
input signal and of the history. Convolutional codes may be more robust than block codes, 
because they can use the redundancy which was spread over several consecutive blocks. They 
are also much more difficult to implement, at least in software. 

In this paper, we will not consider the variants of packet level redundancy that can be 
built with convolutional codes. For practical reasons, we will only consider block codes where 
the transmission of N input packets is complemented by that of M redundancy packets. If 
at least N packets out oi N M are received correctly, then all N input packets can be 
retrieved. If fewer than N packets out of M are received, we cannot gain advantage from 
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the redundancy but we can at least retrieve the fraction of the initial N packets which made 
their way to the receiver. 

Packet level correction is different from bit level correction, because it deals with straight 
packet losses, not with unpredictable bit values. It can thus be implemented very simply 
with a combination of exclusive or (0) and shift (<C) operations. If we note the input 

packets and 1 i..m the redundancy packets, then we can obtain Yj as a function of 

= © {Xi < - 1 )) 

The resources necessary for implementing this form of redundancy are memory and comput- 
ing power. The sender uses M additional buffers to compute the redundancy, the receiver 
must dispose of N M buffers to use it in case of errors. The cost of computing each Yj 
is approximately equivalent to one memory move per input packet. Senders and receivers 
will probably easily dispose of the required memory, but the computing cost has often been 
deemed excessive. A redundancy of form N -{■ 1 may dobble the processing requirement of 
the best TCP implementations. We will thus only consider the redundancies of forms AT + 1, 
TV 4" 2 and iV -{■ 3. 

For simplicity sake, we will assume that errors are independant. With the iV + 1, AT + 2 
and AT -1- 3 redundancies, the perceived packet loss rate is a function of the initial loss rate e 
and the number AT. The resulting rates €i, €2 and €3 can be computed as: 

ei = e (1 - (1 - e)^) 



^3 



C2 = € (1 - (1 - - (AT 4- l)e(l - e)"^) 

= £ - (1 - - (AT + 2)e(l - - (^ + y + 



In each case, the amount of additional overhead is Mj{N 4- M). The question that we set 
up to solve is whether the gain of a reduced error rate is worth the pain of this additional 
overhead, as well as the cost of implementing the redundancy. 



3 FEC and large multicast groups 

Transmission efficiency is often defined as the “goodput” ratio, i.e. the number packets that 
will be received and processed divided by the total number of packets that will be trans- 
mitted. According to this ratio, for point to point transmission, packet level redundancy is 
always less efficient than selective repeat procedures. In one case, we will pay an insurance 
and always send some redundancy in the hope that this redundancy may correct a transmis- 
sion error. In the other case, we only send the redundancy after an error has occured, thus 
effectively minimizing the overhead. The situation changes however if we consider multicast 
transmission. 

To demonstrate this effect, lets consider a very simple model. A source S transmit a 
stream of packets, for example a data file, towards a group of G receivers, using a multicasting 
service. We will assume that all transmission errors occur in the last leg of transmission. 
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Figure 1: Number of transmissions, as a function of the number of members in the group, 
the packet error rate being 1% 

that the errors experienced by the receivers are independent and that each receiver has a 
probability e of not receiving any given packet. This hypothesis is not entirely realistic 
because errors will also occur in the common paths between S and subsets of receivers and 
because receivers will experience different networking situations. We adopt it because it 
results in very simple computations, and also because the simplication does not render our 
results invalid. Multicast packets may be affected by errors at different stages of transmission, 
but it is very clear that the more receivers we add to the group, the largest the probability 
that some of them will loose any given packet. 

The figure 1 shows the average number of time a packet hcis to be transmitted in order to 
be received by a group of size G, cis a function of the size of the group. It takes into account 
both the initial transmission due to the redundancy scheme and the other retransmissions 
caused by the selective repeat procedures. As explained in [1] we assume that all retrans- 
missions are sent to the whole group. In the figure 1 we cissume that the packet loss rate is 
e = 1%, a rate which is not uncommon in the internet today. The figure contains four curves, 
corresponding to the raw Ccise, without redundancy, and to three variations of redundancy, 
namely 7-(-l, 6-(-2 and 5+3. These three levels are chosen arbitrarily, the rationale being that 
they only require the buffering of 8 packets, and are thus reasonably easy to implement.. 
When redundancy is used, e is replaced by the lower value ei, C 2 or ea, i.e.6.8 10”'^, 2 10“^ or 
3.4 10“^ instead of 10“^. For low recipient numbers, the gain of redundancy is not worth 
the increased overhead of even a 7 + 1 scheme. But the number of transmissions increases 
with the number of recipients. As soon as this number reaches 18, the no redundancy curve 
passes above the 7 + 1 limit, which will itself be passed by the 6 + 2 curve when the number 
of recipient is larger than 300. The physical explanation is very simple. A packet will be 
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Figure 2: Number of transmissions, as a function of the number of members in the group, 
the packet error rate being 0.1% 




Figure 3: Number of transmissions, as a function of the number of members in the group, 
the packet error rate being 10% 
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Figure 4: Efficiency as a function of the error rate, for a group of 100 receivers 

retransmitted once if it lost by at least one receiver. The probability of this event is: 

l-(l-e)° 

There may well be multiple retransmissions for a given packet, for example if a sufficient 
number of recipients missed the initial transmission: the common assumption is that a single 
retransmission always suffice is too simplistic. The probability of this event increases with 
the size of the group. In fact, the probability that one given receiver requires exactly X 
transmissions of a given packet is: 

(1 - e) X 

Alternatively, we can write that the probability of having fewer than X transmissions is: 

1 - 6 ^ 

In the case of multiple recipients, a packet will have to be retransmitted as long as it is not 
received correctly by all recipients. The probability of having fewer than X transmissions, 
for a group of size G, is: 

We used this formula to compute the curves of figure 1, as well as those of figure 2, for a 
loss rate of 10"^, and those of figure 3, for a loss rate of 10“^. The comparison of these three 
figures shows that there is not one best level of redundancy. We have to balance the gain 
of reduced errors and the pain of increased overhead, and the trade-off is a function of the 
loss rate and the size of the group. This is further demonstrated by the curves of figure 4 
which shows the efficiency of transmission as a function of the error rate, for a group of 100 
receivers. We can distinguish four successive ranges, where each of the possible solutions 
chose be chosen. We may also observe the limits of packet level redundancy. If the packet 
loss rate becomes very large, say higher than 10%, even a redundancy of rate N 3 cannot 
result in satisfactory performances. This is also visible in figure 3. 

Chosing the best level of redundancy depends however on the sender’s objectives, as 
reducing the packet loss rate has two effects besides allowing a better goodput for file mul- 
ticasting. It also reduces delay, and may be used to simplify the retransmission procedures. 
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Figure 5: Efficiency of classic TCP as a function of the error rate, for a window size of 200 
packets 

4 FEC and simple transport protocols 

In the previous section, we assumed that the transport protocol implemented fully selective 
retransmissions. In practice, implementations often fall short of this ideal. In classic imple- 
mentations of TCP, once an error has been detected by a timer, a retransmission is triggered. 
The connection will remain inactive as long as this retransmission has not been acknowl- 
edged. During this waiting for retransmission, we could have transmitted a full window of 
W packets. This waiting time will occur after every error, thus at the end of every run of 
successful transmissions. As the length of such runs is 1/e, we may define the efficiency of 
the procedure as the ratio of the length of the run over the number of packets that could 
have been transmitted in the absence of error, i.e., the length of the run plus the size of the 
transmission window: 

lA _ 1 

IT + 1/e elT + 1 

This formula does indeed not take into account the details of congestion control algorithms. 
It is in some sense an upper bound to the efficiency of classic TCP. We used it to compute 
the curves of figure 5, where we assumed a window size of 200 packets. These curves show 
clearly the usefulness of redundancy when the error rate is larger that 10“^. The threshold 
does indeed vary with the size of the window. An approximation of this threshold can be 
obtained by observing that iV -f M redundancy is never justifiedif the goodput of straight 
transmission is better than the best that can be achieved with redundancy, that is if: 

1 ^ N 

eW + l"^ N + M 
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This inequality can be simplified to: 



eW> 



M 

N 



By observing figure 5 we see that the converse is also true. 7-|- 1 redundancy is already almost 
justified when the efficiency of straight transmissions falls under 87.5%, 6 + 2 redundancy is 
almost immediately justified when the efficiency of 7 + 1 falls under 75%, 5 + 3 redundancy 
is almost immediately justified when the efficiency of 6 + 2 falls under 62.5%. We will use 
this observation in section 6. 

Indeed, we know how to implement better procedures than classic TCP, using for ex- 
ample fast retransmission in case of duplicate acknowledgement. Matt Mathis, Jamshid 
Mahdavi, Sally Floyd and Allyn Romanow recently proposed a selective acknowledgement 
extension to TCP that would in theory results in an almost perfect utilization of the trans- 
mission resources [2]. However, their proposal establishes a balance between transmission 
efficiency and implementation requirements. As selective acknowledgement can only be ad- 
visory, Mathis et al have to rely on timers to solve errors which would not be cured by a 
first retransmission. This will occur when both the initial packet and the retransmission are 
lost, with a probability of e^. The waiting time of W packets will thus occur at the end of 
each run of successful transmissions or retransmissions. The length of this runs will be 1/e^, 
an upper bound of the efficiency will be: 



i/t ^ 1 

W + l/e^ eW + 1 

We may thus foresee that packet level redundancy will be useful in the case of very large 
windows, even if we implement selective acknowledgements, when the product We^ is larger 
than M/N. In fact, we have also to take into account the need to allocate large resequencing 
buffers when waiting for selective retransmissions. This will be discussed in the next section. 



5 FEC and resequencing delays 

The delivery delay of a packet has three components, initial queuing before transmission, 
transmission and queuing at the recipient side. In order to simplify our demonstration, we 
will assume that the initial queuing delay is nil. We have seen above that the probability of 
having fewer than N transmissions for a group of size G is: 

p{i < N,G) = (1 - 

The retransmissions terminates once a copy of the packet has been correctly received. If we 
assume that the transport mechanism is very efficient, successive packets will be retransmit- 
ted independently of each other, hence may well arrive at different times. Many applications 
require that the packets be reordered before they can be successfully delivered. A packet 
which has been correctly received will thus be kept in a resequencing buffer, waiting for the 
successful retransmission of all proceeding packets. 

In order to evaluate the resequencing delay, we will assume that the initial transmissions 
are regularly spaced. We will also assume that each retransmission takes the same delay 
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0.0001 0.001 0.01 0.1 1 
Figure 6: Sequencing delay as a function of the error rate, for one receiver. 

as the transmission of a full window, “W”. Figure 6 presents the variation of the delay as 
a function of the error rate for basic transmission, without redundancy, and also for three 
different forms of redundancy, 7 + I, 6 + 2 and 5 + 3. The unit of the resequencing delay, 
in this figure, is the round-trip delay itself. We have assume a transmission window of 200 
packets, which would correspond to a T1 satellite network, or to a long distance T3 networks. 
Generally, the effect will be more intense with high values oiW. 

The figure is very expressive. The average resequencing delay soars when the error rate 
grows and so does the variance of this delay grows. The advantage of using FEC for delay 
sensitive applications is thus obvious. The maximum redundancy rate of 5 + 3 results in 
stable delays over a large range of packet loss rates. There are many applications that would 
be happy to pay this overhead, e.g. when user satisfaction matters more than network 
efficiency. 

But this figure calls for another remark. When the resequencing delay increases, the 
memory requirements increase at the same time. TCP requires that the sender keep in 
memory a copy of all unacknowledged packets. The average number of acknowledged packets 
is equal to the sum of a transmission window, W, plus the average number of packets waiting 
to be resequenced. This second part is in fact proportional to the average retransmission 
delay. A consequence is that implementing selective retransmission on “long fat networks” is 
by no means free. Similar requirements will also occur at the recipient, which shall keep a 
copy of all packets waiting for resequencing. This may well explain why implementors have 
insisted on the “advisory” nature of selective acknowledgements: under certain conditions, 
the memory requirements of these systems may exceeds the number of buffers available. 

A system that runs short of buffers cannot benefit fully from the theoretical efficiency of 
selective retransmissions. We are brought back to the imperfect efficiency that we studied 
in section 4. Using FEC in this conditions will stabilize the memory requirements, resulting 
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in an improvement of the goodput. 



6 Choosing the level of redundancy 

All our graphs have shown that there is not one “best” level of redundancy. We already 
discussed this in section 3 and 4 but we limited our analysis to three simple cases, 7 + 1, 
6 + 2 and 5 + 3. Figure 7 shows a more systematic exploration of the problem. We assumed 




Figure 7: Efficiency as the function of the size of the block, for 100 receivers and a packet 
loss rate of 1% 

that the sender knew the number of recipients (100) and the loss rate (1%), and we plotted 
the efficiency of various level of redundancy, A+1, A^ + 2orA^ + 3asa function of the 
number N of “initial” packets in the block. The result is in fact quite obvious. If we can 
pick a block that is arbitrarily large, we should always use N + 3 redundancy. If on the 
contrary there is a limit to size of the block, we may have to settle for less, or maybe for no 
redundancy at all. This trade-off will be a function of several factors such as: 

• the sender’s objective, efficiency or response time, 

• the available memory, that may restrict the size N M of the redundancy block, 

• the size of the multicast group, 

• the observed loss rate, 

• the delay X bans width product. 
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These parameters are likely to vary during the course of a transmission. There is thus a need 
to define an adaptive algorithm for chosing the redundancy level. The good news is that 
using FEC allow us to easily meter the basic error rate without necessarilly suffering from 
these errors. The semantic of acknowledgments and other “transmission reports” should be 
augmented to contain an “observed loss rate” parameter. This loss rate e can be used to 
determine the necessary redundancy, for example with the help of precomputed tables. 

Most packet losses in the current Internet are due to congestion. Algorithms such as 
slow-start have been devised to identify the throughput of the network’s bottlenecks and 
to stabilize the transmission rate around this throughput [3]. The use of forward error 
correction should not result in an increase congestion, which suggests two implementation 
requirements: 

• reduce the congestion window when redundancy is used. 



• use the observe error rate to tune the congestion window. 



A station which decides to use redundancy should not suddenly increase its transmission 
rate. The actual transmission window should be a fraction of the value Wc that was 
computed using the slow-start algorithm. Specifically, this fraction should be: 






N 



N + M 



W, 



Because FEC will wipe out isolated losses, the station will not notice the first packet losses 
that are usually the warning signs of congestion. It may go on increasing its transmission 
rate until the congestion becomes massive. This undesirable behavior will be avoided if the 
receivers report the observed error rate and if the station reduces its congestion window 
whenever this rate increases. 



7 Implementing FEC 

This paper showed that packet level FEC can be implemented with benefits when transmit- 
ting reliably towards large groups, when using simple transport protocols and when requiring 
stable delays. Usual objections such as the concern for congestion may be overcomed by a 
careful tuning of the interaction between forward error correction and congestion control. 

The purpose of our intellectual exercize was to demonstrate the interest of packet level 
FEC. We believe that we made this point. We have now to go back to the workbench and 
come out with a practical demonstration over the Internet. We have good hopes to complete 
this demonstration before August 1996. 
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Abstract 

Multicast (1:N) is now supported by a number of networks and communication 
protocols. A problem in this context is how to provide fully reliable data transmission 
to the receiver group (or specified sub-group) in a heterogeneous environment. To 
achieve this an algorithm is required which is independent of any underlying network 
and multicast scheme. In this paper we present such an algorithm which provides 
fully reliable data transfer over Internet style networks as well as over ATM networks. 
The algorithm was initially inspired by mechanisms proposed for XTP 4.0 and proves 
to work well using XTP 4.0 protocol mechanisms. To evaluate its feasibility and 
performance it was tested with a series of simulations and implemented over MBone. 
The results presented in this paper show that it can provide fully reliable multicast 
to a relatively large number of receivers on top of different networks, protocols and 
protocol architectures. 



1 Introduction 

Multicast communication (i.e. the communication between a single sender and multiple re- 
ceivers) raises various new issues and problems related to data transmission and reliability. 
Different degrees of reliability are required for different applications. Full reliability is for 
instance needed for the transfer of a file to a group of users. In this case all receivers have 
to receive a correct and complete copy of the data. An efficient method to provide this 
kind of reliability while still fully exploiting the advantages of multicast communication 
(e.g. bandwidth saving) is the key issue in the design of reliable multicast protocols. 

In a heterogeneous communication environment a number of different networks are 
connected ranging from LANs (e.g. Ethernet) to Internet style WANs and high-speed 
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networks such as ATM. Whereas with Internet style multicast every member of the group 
receives every message sent to the group, with point-to-multipoint multicast (as proposed 
for ATM) there is only one designated sender from which the members of the group can 
receive. Any algorithm designed for a heterogeneous environment has to take this into 
account. 

Another issue which is frequently addressed in this context is the problem of slow re- 
ceivers that jeopardise the success of the data transmission because of their inability to 
process data at a given rate. Such receivers can slow down the communication consider- 
ably by sending delayed acknowledgments and requesting frequent retransmissions [1]. A 
flexible method to deal with this problem is required. A new concept in multicast com- 
munication is that of mandatory receivers or a minimum number of receivers that have 
to be present in order to consider the communication successful [2]. It should be possible 
to specify a (sub-)set of receivers that are essential for the success of the communication 
and hence determine its outcome. Others, less important participant can also have less 
strict reliability requirements. Hence they might not have to be considered as far as error 
recovery is concerned. 

The Fully Reliable Multicast (FRM) algorithm for end-to-end multicast communication 
introduced in this paper is motivated by our work on multipeer communication services 
in a heterogeneous network environment [3, 4]. This algorithm’ provides full reliability to 
a receiver group (or mandatory sub-group) independently of the underlying network. A 
list-based scheme using positive acknowledgments is employed to keep track of the global 
state of the multicast communication. Retransmission of lost or corrupted data is used 
for error recovery. Conditions can be specified under which slow receiver that can not 
keep pace with the data transmission are ejected to keep the QoS acceptable for the rest 
of the group. Main design principles of the algorithm are independence, compatibility 
to various protocols and protocol architecture, and efficiency. The FRM algorithm was 
initially developed as part of an independent multicast transport service which can amongst 
others operate over XTP 4.0 [3]. The algorithm proved to be ideally suited for this kind of 
environment. To evaluate the algorithm it was implemented on top of MBone and analysed 
using a series of simulations. 

This paper is organised in six sections. Section two discusses reliable multicast com- 
munication including proposals of different methods and algorithms to provide reliable 
multicast in different environments. In section three the FRM algorithm is introduced and 
described in detail. The results of a performance analysis of the algorithm is presented in 
section four. Subsequently in section five the implementation of the algorithm over MBone 
is introduced and discussed. Finally, section six gives conclusion and outlines some future 
research issues. 

2 Reliable Multicast Communication 

Multicast communication is a transmission mode which now is supported by a variety 
of local and wide area networks. Many group applications require reliable data transfer 
to all receivers. To provide data transfer reliability in an heterogeneous environment, an 
algorithm is necessary which can easily be supported over ATM style networks as well as 
over meshed nets like the Internet. In contrast to the distribution group model supplied 
by IP/Multicast (multicast group abstraction), ATM [5] provides unidirectional point-to- 
multipoint connections where a multicast tree is created with a single sender as the root 
node and receivers as leaf nodes (1-N multicast tree). In this model the receivers do 
not have any knowledge of each other nor is there any communication among them. This 
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model of communication can be regarded as a subset of the more general model proposed by 
IP/Multicast where everybody can send to the group address and every member is receiving 
messages addressed to the group. The distribution model of IP/Multicast could actually be 
achieved at the ATM layer by using a multicast server [6]. However, the complexity of this 
approach negates the benefits gained by the simplicity of the ATM concept. It also may 
prevent the users from utilising the QoS capabilities provided by ATM networks. Hence 
we concentrate on multicast communication with a single source which in our view is much 
more common than the multisource case. Further, in our research we found that most 
algorithms specifically designed for the IP environment are, in general, not operational in 
an ATM-like network. 

Reliable multicast communication entails a range of new problems affecting end-to-end 
mechanisms. For instance loss and error recovery is different from traditional unicast. An 
entirely new task is group membership management. Main problems in this context are 
receiver group integrity and QoS maintenance. 

In order to achieve full reliability the sender has to keep a copy of a transmitted packet 
until all receivers in the receiver group have positively acknowledged its reception. In 
this case an algorithm that keeps track of the state of all receivers is required. This is 
usually done by using a list-based group management scheme. List-based loss recovery 
algorithms offer reliable multicast by tracking the explicit membership of the receiver set 
and by providing retransmission for any lost or corrupted data. Hence full reliability is 
ensured for the whole receiver group. 

Compared to the unicast case it is much more difficult in multicast to determine the 
success or failure of the communication. Multicast communication introduces a new prob- 
lem related to the integrity of the receiver group. Group integrity refers to conditions 
according to which the data transfer is deemed successful. These conditions are called 
communication integrity conditions. They give the number and/or identity of recipients 
that have to be present and/or to receive the transmitted data correctly. The following 
conditions are considered: A;ey-mem6er (identity) and/or (number), a//, quorum^. 

Another important issue in this context is the problem of misbehaving receivers. A 
slow receiver can jeopardise the communication by not responding in time or requesting 
too many retransmissions. In general, if a receiver falls behind the others in the sequence 
space, the sender has to keep up with the pace of this slow receiver. This causes a decrease 
of the end-to-end throughput noticed by all receivers. To avoid this degradation on the 
performance, a condition can be specified under which a receiver is forced to leave the 
receiver group because it jeopardises the communication. This condition is called ejection 
condition. 

The problem of reliability in multicast communication was recently addressed by a 
number of authors [7, 8, 9, 10]. The RAMP algorithm introduced in [7] guarantees reliable 
and orderly delivery to all multicast recipients using two different modes. In the burst 
mode the sender requires acks for each data burst. With the idle model the sender keeps 
sending all the time. In this mode, a negative acknowledgment scheme is used where a 
receiver notifies a sender immediately upon detecting a gap within the received sequence. 
It maintains explicit group membership at the sender and therefore does not scale very 
well for large groups. RAMP uses a simplex model for multicast communications. 

RAMP is the only model that strictly follows the single sender multiple (passive) re- 
ceiver paradigm. The algorithm described in [8] (called a framework for scalable reliable 
multicast (SRM)) wzis especially designed for wb (white-board application). Scalability 
is achieved by a receiver- based reliability scheme in conjunction with Application Level 

^quorum and a// refer to the number of positive replies during connection establishment. 
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Framing (ALF). All receivers are able to retransmit since all data packets (application 
data, retransmission requests and all other data) are always distributed to the entire 
group. SRM requires that the application is aware of the operational environment and 
that it is able to restore the data for retransmission. Further, an IP multicast distribution 
model is assumed. SRM is strongly focused on the specific environment it was designed 
for and on the kind of application it aims to support. RMTP [9] achieves reliability by 
using a packet-based selective repeat retransmission scheme. The ACK-implosion problem 
is avoided using a multi-level hierarchical approach, in which . the receivers are grouped 
into a hierarchy of local regions, with a Designated Receiver (DR) in each local region. 
DRs cache received data and respond to retransmission requests of the receivers in their 
local regions. It assumes a group distribution model as provided by IP/Multicast (receiver- 
based approach). In [10] an hierarchical approach, named Local Group Concept (LGC), 
is used as well. Here, global multicast groups are divided into separate subgroups. The 
receivers which pertain to the same subgroup should be located in close physical vicinity. 
In each subgroup a Local Controller is responsible for collecting status information from 
its receiver group and is informing the sender with a single composite message to prevent 
source implosion. The Local Controller is also responsible for retransmission within the 
local group. Data that has to be retransmitted is obtained either from the local receivers 
or the original sender. All of the above described algorithms assume that it is possible to 
retrieve data for retransmission from the receivers as well as from the sender. To exploit 
their full potential it would be necessary to have knowledge about the underlying network. 
None of the introduced algorithms so far addresses the problem of misbehaving receivers 
and the consequences of this for the multicast transmission. 

The FRM algorithm described in this paper provides fully reliable data transfer and 
group membership management for multicast communication. It defines proper mech- 
anisms for loss recovery, integrity conditions validation and throughput monitoring and 
maintenance by using ejection conditions. All these mechanisms rely on the use of a list- 
based group management scheme which is akin to the one proposed by XTP 4.0 [11]. 

3 The Fully Reliable Multicast (FRM) Algorithm 

The algorithm is part of a protocol independent transport service and as such uses and 
shares information with other components which are gathered continuously during an on- 
going communication. However, the main design goal was independence ^ i.e. the algorithm 
should be operable with all kinds of end-to-end protocols and services in different environ- 
ments. 



3.1 Algorithm Description 

Initially a message is send to a group address announcing the set-up of a communication 
among members of the group. A receiver group list is created from all positive replies to 
this request. During the communication, the sender gathers status information in regular 
intervals from the receiver group to update this list. To do this the sender periodically 
multicasts a control message to the receiver group to which each receiver immediately 
responds by unicasting an echo message to the sender. To correlate control and echo 
messages, each control message is identified by a value {sync) which is incremented each 
time a new control message is sent. Each receiver must copy the sync value from incoming 
control messages into the echo messages returned to the sender. The control message is 
used to solicit the status of the receivers and the echo message is used to transfer this status 
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information to the sender. Thus, the current state of the communication is determined by 
the sender according to the replies of the receivers. 

The status information carries the reception status of the receiver which is represented 
by two variables called rseq (received sequence number) and dseq (delivered sequence num- 
ber). The rseq contains a value one larger than the largest monotonic sequence number of 
the data packet received without error. This information is used to trigger a retransmission 
if necessary. The dseq value gives the sequence number of the last data message delivered 
to the user plus one. Buffers at the sender are released using the former value. Further 
information exchanged in the status information is used to update state variables required 
for such mechanisms as flow- or rate-control, maximum round-trip delay estimation, etc. 
The figure 1 depicts the basic structure of the FRM algorithm. 



Sender 



Receivers 




receiver-group list global record 

Figure 1: FRM Algorithm Structure. 



Each time the sender multicasts a control message it sets a timer {wtimer). The value 
of wtimer determines the time interval during which the sender gathers the replies to 
the control message. When the timer expires the sender gets the lowest rseq (minrseq) 
and dseq values from all replies {mindseq). It starts retransmission if needed and releases 
output buffers accordingly. Further, it estimates the maximum round-trip time and updates 
the variables used in the flow- and rate-control mechanisms. Subsequently a new control 
message is sent to start this procedure again. 

If a response from a receiver does not arrive for some reason during the wtimer interval, 
a synchronising handshake is initiated by the sender. During the synchronising handshake 
data transmission is stopped and the sender requests status information from all receivers. 
The sender tries for a user- specified number of times to get replies from all receivers. If 
a receiver does not reply at all during this period, this receiver is considered dead and is 
dropped from the receiver group. At each new request, the sender increases the period in 
which it collects the replies. 

Because of the effects that the estimate maximum round-trip time has on the algo- 
rithm’s performance, its value has to be estimated carefully. In our model the round-trip 
time for each receiver is calculated by timestamping the control message with the sending 
time. Each receiver must copy this time value from the control messages into the echo 
messages returned to the sender. When the sender processes the echo messages, it calcu- 
lates the round-trip time for each receiver. Note, the round-trip time includes the network 
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round-trip time, the waiting time of the echo message in the input buffers and the process- 
ing time of this message. When the worst case round-trip time (maxRTT) is smaller than 
the current estimation (smaxRTT), the worst case round-trip time is smoothed, i.e. it 
assumes a lower value according to the equation below. Otherwise, the current estimation 
assumes this new value. Therefore, the estimation of the maximum round-trip time for 
each control message is done according to the following algorithm: 

if (smaxRTT > maxRTT) 

smaxRTT := smaxRTT -|- (maxRTT — smaxRTT)/l^\ 
else smaxRTT := maxRTT; 

Network delay and receiver load are the main factors influencing the value of the maxi- 
mum round-trip time. The FRM algorithm is sensitive to changes of this value. In general, 
the maximum round-trip time estimation adapts to the capabilities of the slowest receiver. 
This can cause problems in a heterogeneous environment where its value might be very 
large for some receivers (they might for instance be connected via satellite) compared to 
the rest of the receiver group. This can result in a very poor performance even if the 
number of receivers is relatively small. 

The value of the wtimer has to be large enough to allow all receivers to reply to the 
related control message before it expires. The wtimer is calculated using the following 
equation: 

wtimer := k * smaxRTT^ where k > \ 

The multiplication factor k in the above formula is used for contingency reasons. This 
factor can be changed, for instance in an unstable environment an increase of this factor 
ensures that the sender does not enter in a synchronising handshake too often. Thus, 
it ensures that all replies will be received in the wtimer interval. The larger the value 
of k the higher is the probability that all replies will be received in this interval. Hence 
unnecessary synchronising handshakes are avoided. However, the higher value of k has 
a negative impact on the throughput of the multicast communication because the sender 
transmitting buffers cannot be released as frequently as with a smaller value for k. 

The loss recovery algorithm uses a global record which is kept at the sender. This 
record is a data structure containing the current state of the communication in terms of 
data (re)transmission. This global record keeps the sync value of the last control mes- 
sage sent, the value one larger than the largest sequence number of the data packet ever 
(re)transmitted just before sending the control message (Iseq) and the sync value of the 
control message during which data packets were previously retransmitted (svdsync). After 
collecting the statub information of the receiver set and by using the information contained 
in the global record, the sender can decide if retransmission is needed or not while avoiding 
redundant retransmissions. If retransmission is needed, the packet will be multicast to the 
whole group. A go-back-n scheme for retransmission is used in the first version. 

3.2 Group Integrity 

The FRM algorithm provides mechanisms to check communication integrity conditions. A 
strong notion of group integrity is provided by the FRM algorithm since it ensures that 
either all mandatory recipients have received the data correctly or that it is still available. 
Hence, integrity conditions are only breached when a mandatory receiver is not present 
any longer. A situation where the group integrity is momentarily not valid can only occur 
between two control messages. 
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Integrity conditions are validated every time the sender discovers that a receiver is not 
present any more. This may be the case when a receiver decide to leave or is forced to leave 
the receiver group. Further, the sender also validates the integrity conditions at the end of 
each integrity interval which is determined by the current wtimer value. Every time the 
sender detects that a mandatory receiver is not present any more, the sender will release 
the communication since the integrity conditions are not satisfied any longer. 

3.3 Ejection 

In the FRM algorithm the user may specify a minimum degree of end-to-end throughput 
in order to consider the communication successful. An ejection condition based upon the 
release rate throughput at the sender (relthr) is defined for this purpose. This parameter 
indicates the mean number of buffers released during a certain time interval {ejectcdtintv). 
Note that buffers are only released when all receivers have correctly received the associated 
messages. Thus, the user specifies a minimum release throughput (minrelthr) as ejection 
condition. When the calculated release throughput falls below the minrelthr, the ejection 
condition is satisfied. The slowest receiver(s) (if any can be determined) is then forced 
to leave the receiver group. Network problems which affect all receivers or a sub-set of 
receivers may be the cause for the degradation in the end-to-end throughput as well as 
problems in the receivers themselves. Hence, the main issue is how exactly to determine the 
cause of the ejection situation. Thus, before ejecting any receiver(s), it has to be ensured 
that the offending receiver(s) is the sole reason for the poor performance. 

To establish which receiver is causing the problem, additional information is kept in the 
receiver group list. The slowest receiver is the one which acknowledges the lowest sequence 
number (dseq) for a certain consecutive number of times. In addition, the sender calculates 
the mean release throughput of each receiver (mrelthr). This value is calculated based on 
the replies to each control message. It gives the mean number of buffers that would be 
released if only this receiver took part in the communication. When the ejection condition 
is met, the sender has to decide which receiver(s) have to leave the communication. To 
do this it checks if a slowest receiver exists. In this case, the sender ejects this receiver. 
Otherwise, it compares the mrelthr of each receiver against the minrelthr. All receivers 
that have a mrelthr smaller than the minrelthr are forced to leave the communication. 
This strategy ensures that the sender only forces the offending receiver(s) to leave. 

The figure 2 shows the influence of the ejection condition when applied to multicast 
communication using the FRM algorithm. In this example we consider ten receivers of 
which two can not keep up with the rest and fall behind the minimum release throughput. 
The minimum release throughput specified by the sender is 180 bytes/s. As soon as the 
mean throughput drops below this value, the sender determines the slow receivers and ejects 
them. This action shows the anticipated effect, i.e the communication recovers and returns 
to the initial better performance. Hence it prevents that slow receivers jeopardise the 
communication by slowing it down beyond an in advanced agreed threshold. In comparison, 
the curve without ejection condition shows that the sender synchronises with the slow 
receivers. 

3.4 The algorithm 

For each echo message received the sender updates the receiver-group list and whenever 
the wtimer expires the sender runs the FRM algorithm. A complete description of this 
algorithm including all its procedures is presented in figure 3. 
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Figure 2: Ejection Condition Performance. 



4 Performance Analysis 

To evaluate the FRM algorithm and the influence of different scenarios on its efficiency 
and performance, it was analysed using a test suite of simulations. This section introduces 
and discusses the simulation results. 

The simulator package used is QNAP 2 [12]. For the simulation a multicast commu- 
nication model with a variable number of receivers (iV) was developed. Both sender and 
receivers have a restricted buffer capacity of B packets. A constant data packet size S is 
assumed. The processing times for control (C), data {D) and echo {E) packets are as fol- 
lows: E = X, and D = 3* x. The execution time of the loss recovery algorithm is x. The 

underlying network is ATM characterised by a transfer rate of 155.5 Mb/s and a packet 
error rate of L. The distance between one sender and the z-th receiver is determined by 
a propagation delay of A, where D{ is uniformly distributed in the interval [DminiDmaxV 
The sender generates data traffic with the constant time span SData between two sub- 
sequent packets, and the receiver user consumes data with a constant rate of T SData. 
Table 1 comprises all above described simulation parameters. 

The main parameter used to evaluate the performance is the mean response time for 
data packets (MRT). The MRT is the mean time elapsed between the sender applica- 
tion submits a data packet for transmission and its delivery at the receiver side. The 
loss recovery algorithm is analysed according to the results obtained for this performance 
parameter. 

The algorithm is evaluated using the MRT as a function of A. In these simulations 
the propagation delay is uniformly distributed between [10,15] ms. The processing time for 
the packets are obtained using x = 0.5 ms. In addition, the following parameter values are 
assumed: B = 32 packets, S = 2000 bytes, SData = 10 ms, L = 10“^, TSData = 0.5 ms. 

Figure 4 shows the scalability of the algorithm. The a:-axis shows the number of 
receivers and y-axis shows the mean response time. The FRM algorithm presents a steady 
increase of the MRT. After a certain number of receivers the sender is no longer able to 
handle them efficiently. Therefore, the algorithm does not scale well when the number of 




Reliable multicast in heterogeneous environments 



129 



if (all receivers replied) { 

updaie_ variables; /* mindseq, minrseq, maxRTT, ... */ 
calculaie_ smaxRTT; 

release_ output_ buffers; /* until mindseq */ 
if (ejectcdtintv) { 

check_ ejection_ condition; 

if (eject _ cdt_ satisfied) leave_ offending_ receiver(s); 

} 

check_ comm_ integrity _ conditions; 
if (!comm_ int_ cdt_ satisfied) release _ communication; 
if (minrseq < Iseq) { 
if (sync > svdsync) { 

/* retransmission is not redundant */ 
retransmit_ buffers; /* from minrseq */ 
svdsync := sync + 1 ; 

} 

} 

sync := sync + 1 ; 

send_ control_ message; 

set wtimer; /* k * smaxRTT */ 

} 

else synchronising handshake; 



Figure 3: FRM Algorithm Description. 



receivers reaches a certain threshold (in our example 150 receivers) but further optimisation 
to allow a large number of receivers be accommodated are currently being investigated. 

4.1 Influences and Interfering Factors 

The algorithm is influenced by different external factors related to the heterogeneity of 
the network and receiver group, the topology, error rate, etc. However, the main factors 
influencing the mean response time with our list-based FRM algorithm are management 
overhead and network error rate in conjunction with buffer restrictions in the sender. The 
management overhead, usually referred to as source implosion effect, leads to a steady 
increase of the MRT because with every new receiver the sender has to process more 
control packets and to manage a growing number of receivers. During this time it is not 
able to send any data. In the worst case the sender is only dealing with management tasks 
and is thus hardly able to send any application data. 

However, buffer restrictions also considerably influence the performance of the FRM 
algorithm. Figure 5 shows that with less buffer space in the sender the MRT increases 
suddenly and very rapidly. With an increasing number of receivers the maximum round- 
trip time is increasing because the increased management overhead adds to the estimate 
maximum round-trip time. The sender can only send as long as there is enough buffer 
space left. Once all buffers are full the transmission has to be stopped until all receivers 
have acknowledged the correct reception of the data. With an increased round-trip time 
the frequency with which buffers are freed becomes lower. Hence, transmission of data has 
to stop more often because of insufficient buffer space. This in turn has a negative effect 
on the encountered MRT. 
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parameters 


meaning 


N 


number of receivers 


B 


buffer capacity in number of packets 


S 


data packet size 


c 


control packet processing time (= x) 


D 


data packet processing time (= 3 ♦ i) 


E 


echo packet processing time (= x) 


X 


constant used to determine processing times 


L 


packet error rate 


Di 


propagation delay between the sender and the *-th receiver 


Dmin 


minimum propagation delay 


Dmax 


maximum propagation delay 


SData 


constant time between two data packets generated by the sender 


TSData 


constant service rate of data packets at the receivers 


MRT 


mean response time 



Table 1: Simulation parameters. 



Data loss or corruption has the same effect on the performance of the algorithm since 
retransmission prevents that buffers in the sender are released. Only when all receivers 
have acknowledged the correct reception of the retransmitted data, buffers can be released. 
With an increasing number of receivers the failure probability increases well leading to 
a large number of retransmissions. 

Our results indicate that to achieve a better performance while providing full reliability 
it is necessary to optimise buffer utilisation and to remedy the source implosion effect. 
The former requires a more frequent release of buffers, i.e. the time in which buffers can 
be released should be decreased. This would also result in reduced sensitivity towards 
the estimate maximum round-trip time. The latter can be realised by distributing echo 
messages over time. Further optimisation might allow a large number of receivers with less 
overhead, but it will not be possible to achieve full reliability without any extra costs. 

5 Implementation over MBone 

To show its applicability in an Internet environment and to validate the general na- 
ture of our FRM algorithm, we have also implemented it using the multicast backbone 
(MBone) delivery service. The implementation resides on top of the User Datagram Pro- 
tocol (UDP) [13]. Apart from development reasons the main motivation for this solution 
was compatibility with other Internet protocols and the current Internet philosophy. In 
the same way RTP [14] provides support for real-time communication over UDP, a pro- 
tocol or a set of protocols can be used to provide the required reliability for multicast 
communication over UDP. The FRM algorithm could be part of such a protocol. 

The sender transmits all multicast packets (DATA and CNTL messages) to a Class 
D multicast address determined before the communication is set-up. The receivers reply 
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Figure 4: FRM Algorithm Scalability. 

(ECHO messages) to the unicast address of the sender. Therefore, all packets originated 
at the sender are transmitted via a 1 to N multicast tree to the receiver group, and the 
replies from the receivers are sent using the conventional unicast delivery service. Only the 
designated sender sends to the multicast address. No other sender must use this address 
for this kind of fully reliable multicast communication. 

Before transmitting data the multicast sender initialises the communication by sending 
a IN IT packet to the multicast group address. This packet informs the receivers about 
the sender’s name and unicast address, the minimum sustainable release throughput and 
the packet size. Upon reception of this packet each receiver willing to participate replies 
with an IN IT packet informing the sender of his globally unique identifier and his buffer 
size. In this so called initialisation phase the sender estimates the maximum round-trip 
time for the first time and it initialises his variables accordingly. 

To test the FRM algorithm over the Internet and in order to conduct experiments on 
a wider scale a user-level process at the sender which reads a local file and performs a 
multicast file transfer is used. At the moment we are conducting a number of experiments 
between the Laboratoire MASI in Paris, Lancaster University in the UK, and Sophia- 
Antipolis in Nice to measure the implementation’ s performance. 

6 Conclusion 

Multicast (1:N) communication is an efficient way to transmit the same data to a group of 
receivers. A number of networks (such as Ethernet, token ring, ATM) and protocols (e.g. 
IP, XTP 4.0) support multicast to some extent. A major problem in this context is how to 
deal with data transfer reliability in a heterogeneous environment where different multicast 
schemes might be employed. A way to support full reliable data transmission over a number 
of different networks while still utilising the advantages of multicast communication is 
required. 

The FRM algorithm introduced in this paper provides fully reliable data transmission 
to a group (or specified sub-group) independently of the underlying protocols or network 
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Number of Receivers 



Figure 5: MRT with different buffers size. 



structure. A list-based group management scheme, positive acknowledgments and retrans- 
mission for error recovery are employed. Conditions under which slow receivers that jeop- 
ardise the success of the communication are ejected can be specified. This guarantees that 
the QoS does not drop below an acceptable value specified in advance. 

Major design goals of our service were protocol and network independence, compati- 
bility and efficiency. To evaluate our algorithm and to assess its performance it was tested 
in different environments using real protocols and simulation tools. It has proved to work 
with XTP 4.0 mechanisms as well as on top of a U DP/IP protocol architecture. Our ex- 
perience with the algorithm shows that it provides an efficient and effective way to ensure 
fully reliable multicast in a heterogeneous environment. It is also flexible and can even be 
used when different degrees of reliability are needed for a sub-set of the receiver group. 

The maximum sustainable number of receivers is mainly restricted by two factors, 
namely limited buffer in the sender and the source implosion effect. Source implosion refers 
to a situation where the sender is too busy processing acknowledgments from receivers and 
thus is unable to send new data. To remedy these problems we are currently working on 
a scheme where buffers are released more frequently and source implosion is avoided by 
distributing receiver acknowledgments over time. 
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Abstract 

The MBONE provides an infrastructure for multicast communication on the internet based 
on IP. Several proposals have been made to create reliable multicast transport on top of the 
MBONE structure. Nearly all research on reliable multicast protocols for the MBONE focuses 
on ARQ error recovery. The counterpart of ARQ, FEC, does not guarantee 100% reliability 
but increases the reliability. The aim of this paper is to determine the best place in a multicast 
tree for the use of FEC. We develop a framework that allows us to model analytically the 
impact of FEC on the average number of transmissions necessary to transmit a packet to 
all members of the multicast group. We look on different multicast tree topologies, different 
degrees of correlated loss in the multicast tree, different multicast group sizes and show the 
effect of FEC in terms of transmissions needed to achieve 100% reliability. We find that the 
shared part of the multicast tree is not always the best part to employ FEC. 

Keywords 

Reliable Multicast, Forward Error Correction, Topology, Scalability 



1 INTRODUCTION 

Reliable multicast communication must handle the problem of transporting information from 
a sender to a number r of receivers via a network that is subject to packet loss. Two well- 
known methods exist to deal with loss: ARQ and FEC. 

Adopting known ARQ schemes from point-to-point communication to 1 : r multicast com- 
munication is relatively straightforward and analytical results from early research [TM87], 
[Tow85j showed that ARQ with selective retransmission yields the best throughput perfor- 
mance. 

After the first installation of the MBONE [Dee88] a few years ago, a growing part of the 
internet routers supports IP multicast. With the availability of the MBONE reliable multicast 
protocols have been implemented for the WAN environment and results have been reported 
in [RA95], [FJM'^95), [YGS95]. The WAN environment provides a new challenge: reliable 
multicast communication had to face the scalability problem that arises for a large number 
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of receivers. The research community proposed two directions to deal with the scalability 
problem: 



• timer approaches [Gro96], [FJM’^95] that introduce a certain asynchronity among receivers 
to protect the source from a NAK-implosion and 

• cluster approaches [YGS95], [LP96], [Hof96] that achieve the same goal by introducing a 
hierarchy into the multicast tree. 



In this paper we are not following one of these ARQ approaches, but use Forward Error 
Correction (FEC) to decrease the packet loss experienced in the multicast tree. FEC adds a 
certain redundancy for the transport to the information resulting in a lower loss experienced 
by the receiver. FEC adds redundant information allowing a receiver to recover the original 
information even under partial loss. FEC in multicast connections [Car95] can have a big 
effect, since the global loss will increase with the number r of receivers. We evaluate the 
effect of the group size on the success of FEC. 

We look at different WAN-LAN scenarios and quantify the effect achieved by FEC in terms 
of the average number of transmissions. We are also evaluating the effect that is achieved 
when FEC is applied either just on the LAN part or just on the WAN part of the tree and 
we want to know where FEC performs best. A counterpart for our research in FEC is [ZS91], 
where ARQ on a LAN-WAN-LAN environment is examinated, with the conclusion that 
edge-to-edge error control performs very poorly and requires a lot of buffer. The advantage 
of- FEC in an edge-to-edge case is that nearly no buffer is required and just a small encoding 
(decoding) delay is experienced. In [McA90] a VLSI coder and decoder for a burst erasure 
code is shown that can code at a speed of 1 Gigabit per second. The end-to-end semantic of 
the ARQ scheme on top of the FEC codec is not influenced. When we refer in the following to 
a LAN we mean a subnetwork connected to a WAN and refer to it also as a private network. 

We demonstrate also that the effect of FEC on the multicast tree is strongly dependent on 
shared links. Nearly all the research on the performance of reliable multicast communication 
[TM87], [Tow85], [Den93], [ST87], [PTK94] assumes multicast trees with links where the loss 
over any link affects only a single receiver, referred to as independent loss. The assumption 
that the paths to different receivers have no shared links is typically not true for trees 
constructed by multicast routing algorithms as CBT [BFC93] or PIM [DEF“'"96]. We are 
evaluating FEC for different degrees of dependent loss by varying the number of shared links 
on the path from the source to the receivers. 

Multicast routing algorithms establish the routes for a multicast group, that form a tree 
rooted at the sender. When we speak in the following of a tree, then we mean a multicast 
tree and vice versa. 

The rest of the paper is organized as follows, first our performance measure is introduced 
and we define the topologies used to model the multicast trees. In the section Results we 
show the effect of FEC when applied on the whole multicast tree and we compare the cases, 
where FEC is just applied on the WAN part or just on the LAN part of the multicast tree. 
Finally we draw the conclusions. 
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2 PERFORMANCE MEASURE 



We look at the number of transmissions that we need to transmit a single packet to all 
receivers. We believe that the expected number of transmissions captures the global packet 
loss in the tree and is the best way to evaluate the effect of FEC in a multicast tree. 

We model FEC in a simple way by assuming that FEC will lower the link loss probability q 
by two orders of magnitude.* To evaluate the effect of FEC, we define the gain by employing 
FEC in terms of transmissions. 

Let T be the average number of transmissions without FEC and Tfec the average number 
of transmissions with FEC. Then, we can define: 



T 

gainpEC = 7^ 

J-FEC 



( 1 ) 



The gain performance measure depends on the number of receivers r, the link loss proba- 
bility q and the topology of the multicast tree. To evaluate reliable multicast transmission, 
we calculate the expected number of transmissions needed to deliver a packet to all receivers. 
In [BMT94], the expected number of transmissions is given for loss at nodes in the multicast 
tree. We consider loss on a link, which is more appropriate due to two facts: loss at the source 
node is unlikely and link loss can be associated with loss in output buffers in routers. How 
to calculate the number of expected transmissions based on link loss and the multicast tree 
topology is given in the following. 

Given that the packet is always successfully received by the predecessor (father) of node n, 
then let T(n) be the number of transmissions of a packet until received by node n and all 
receivers in the subtree rooted at n and let Fn{i) = P{T(n) < i) be the respective CDF. We 
assume a constant link loss probability q on the link leading to a node n and will calculate 
the expected number of transmissions E{T{S)) at the source 5, using the CDF Fs{i), as 



E{T{S)) = ^(1 - Fs{i)) 

i=0 



( 2 ) 



We can calculate Fs{i) in a recursive fashion, starting at the leaves of the multicast tree. To 
obtain Fn{i) for an arbitrary node n we must distinguish three cases. 



• Node n is a leaf /, then the probability that fewer than z -h 1 transmissions are needed to 
deliver the packet to n over the link leading to n is: 

Ft{i) = P{T{l)<t) = l-q* (3) 

• When n is an internal node, then exists one link leading to n and at least one child c. If 

there are i attempts to deliver the packet over the link leading to node n and it is lost 
exactly u times with the probability then a copy of the packet is forwarded 

i — u times on every outgoing link to every child. The conditional probability that all 

• Applying FEC end-to-end on a path with / links, instead of applying it on every link I does not change 
our analytical results, since for a small link loss probability q the approximation 1 — (1 — w g/ holds. 
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children of n and the nodes in the subtrees rooted at the children are receiving the packet 
during these i — u times is YlcechUd{n) Fc{i—u). So we get Fn{i) by summing over all possible 



i— 1 



Fnd) = E 
u=0 



c^child{n) 



(4) 



• Node n is the source S', then there is no link leading to S and consequently only the loss 
experienced by its children c has to be considered: 

Fs(i) = n Fc{i) (5) 

c&child{S) 



3 MODEL OF MULTICAST TREE AND LOSS 

For our analysis, we define a generic multicast tree, referred to as tree 3 (see figure 1) that 
allows us to vary the essential parameters of a MC tree: link share, path length, number of 
receivers, and the link loss probability. In this multicast tree, we have r receivers, each at the 
same distance from the source. The number of links that each receiver shares with the other 
receivers is modifyable. The tree height is fixed to = 30 and the number of unshared links 
on the path from the source to every receiver is /c = 1 ... 30, yielding h — k = 0..29 shared 
links. We have now two link loss probabilities: qi for a shared link and Q 2 for the other links. 
This will help us to evaluate the loss in heterogeneous environments, where e.g. the shared 
links are in a WAN, and the other links are in a LAN. For tree 3 the loss probability on the 
path from the source to every single receiver is the same. 



Tree 3: 




k 



h 

S: Source r: number of receivers 

R: Receiver ® h: distance (in hops) from the source to a receiver 

ql: WAN link loss probability 
q2: LAN link loss probability 

Figure 1: The multicast tree 3. Scalable in /i, k and r. 
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The expected number of transmissions in Tree 3 can be evaluated according to (3), (4), 
(5) and (2). The calculation can be simplified by aggregating the h — k shared links to one 
shared path sP with a packet loss probability of Qsp = 1 — (1 - The k unshared links 

on the path to every receiver can be aggregated into one branch B with the loss probability 
= 1- {1- q 2 )^, see figure 2. 




k 



h 

S: Source r: number of receivers 

R: Receiver ® h; distance (in hops) from the source to a receiver 

qsP: packet loss probability on shared Path 
qB: packet loss probability on Branch 

Figure 2: The aggregated multicast tree 3. 



This aggregation yields: 

For the single child c of the source using (4) we get: 

= E Q (9»^)“(1 ~ ~ (6) 

and for the source itself: 

FjreeS^^ = ( 7 ) 

Having the expectation: 



ETrUT(S)) = E(1 - (8) 

i=0 

Note that for a shared path of length — A; = 0, we have the case of r children at the source, 
which is different from the above case, described by (7) where we have just one child of 
the source. The above expressions for and nevertheless valid and yield the 

same result in ETree^{T{S)) than the computation with r children at the source according 
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to (3), (4) and (5). This will be shown in the following. For h — A; = 0 it is Qsp = 0 and 
Qb = 1 — (1 — g'2)^, using the expressions (6) and (7) for one child at the source yields for F5: 



^rree3(-) ^ JpTree3(^^ ^ ^ 






qrr = (1 - q%r 



(9) 



This is then the same as having r children at the source, as expressed in: 



= n = (Fewr = (1 - 4r 

C=1 



( 10 ) 



The expected number of transmissions can therefore be calculated in any case with the 
expressions in (6) and (7). 

A case to consider is big multicast groups with multiple receivers in several LANs. This 
case is modeled by multicast tree 4, see figure 3. Multicast tree 4 is based on tree 3 and has 
also a height of h = 30. Tree 3 models now the part of the Multicast tree, that resides in 
the WAN. The place of the receivers in tree 3 is now taken by LANs, introducing another 
hierarchy level. At the entry point into a LAN the tree splits another time into disjoint 
branches consisting out of / == 5 links and a low link loss probability q^. The part of the 
multicast tree in the WAN has therefore a height of h — / = 25. As in tree 3, there exists one 
common path of length h — l — k — 23 — k hops of links with a link loss probability q\ that 
splits into branches of length A: = 1 ... 25 consisting out of links with link loss probability ^'2? 
where q 2 ~qi- At the entry into the LAN each of these branches splits into tlan — 5, 25, 100 
unshared branches of length / = 5 for every receiver. We will now have r = 100 .. . 3000 
receivers in total, since we consider big multicast groups. The number of LANs is thereby 
given as — 

ri. AN 
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rLAN 



rLAN 



r 



S: Source 

R: Receiver (§) 

r: total number of receivers 

rLAN: number of receivers in a LAN 



h: distance (in hops) from the source to a receiver 

k: length (in hops) of a WAN branch 

1: leng% (in hops) of a LAN branch 

ql: link loss probability on WAN shared path 

q2: link loss probability on WAN branch 

q3: link loss probability on LAN branch 



Figure 3: The multicast tree 4. Scalable in /i, k, /, r and riAN- 



The expected number of transmissions in tree 4 is given in a similar way as the expected 
number of transmissions in tree 3. Let QwansP = 1 - (1 - be the loss probability 

on the common path in the WAN, Qwanb = 1 - (1 - q 2 )^ the loss probability on a branch 
in the WAN and qiANB = 1 — (1 — ^ 3 )^ the loss probability on the branch in the LAN, then 
the number of transmissions in tree 4 is given according to (3), (4), (5) and (2) by: 



i-l / \ fracTTLAN 

( \{QwanspT{^ - qwANsp)^''~^^ n (11) 

U=0 W C=1 

F^^^(i-u)= ^ r \{qwANBy{'^ - qwANB)^''~‘^~^\'^ - qiANBY^"^^ (12) 

j=0 \ j J 

4 RESULTS 

Our analysis of the FEC gain consists of three parts. 

The first part shows for small and for big multicast groups the effect of FEC when applied 
from the source to every receiver. 

The second part shows the effect of FEC for a small multicast group and compares the 
FEC gain when FEC is applied either on the LAN part of the tree or when it is applied on 
the WAN part. 
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The last part shows the effect of partial FEC for big multicast groups with multiple 
receivers in several LANs. 

4.1 FEC on the whole multicast tree 

Using tree 3 we investigated the FEC gain for a different number of shared links and a 
different number of receivers as shown in figure 4 and figure 5. We assume a homogeneous 
link loss probability in tree Soiq = 10“^ without FEC and a link loss probability of ^ 10”® 

with FEC is assumed. 



Expected gain by applying FEC to a muHIcast connection, q 0.001 , with FEC: q r. 0.00001 




Figure 4: The gain of applying FEC to a 
multicast connection for a small 
multicast group and different degrees of 
link share. 



Expected gain by applying FEC to a multicast connection, q = 0.001 . with FEC: q = 0.00001 




Figure 5: The gain of applying FEC to a 
multicast connection for a big multicast 
group and different degrees of link share. 



Figure 4 shows the FEC gain for a small multicast group. FEC yields an expected number 
of transmissions that is close to one, since there are few links in the multicast tree and the 
link loss probability q = 10”® is very low, compared to q = 10”^ without FEC. The shape of 
the FEC gain in figure 4 is therefore caused by the expected number of transmissions without 
FEC. The reason for the increase of FEC gain with the number of receivers is that every 
new receiver causes k new links, and a packet is subject to loss on every link. The reason 
for the decrease of the FEC gain with the number of shared links is also due to the smaller 
number of links in the tree. Two major conclusions can be drawn: FEC is more efficient for 
a high number of receivers and FEC is more efficient for independent loss. In the case of 
independent packet loss (no shared links) and 30 receivers 60% more transmissions have to 
be made at average to deliver a packet to all receivers than in the case where FEC is used. 
Having a high number of shared links yields nearly no gain by applying FEC. 

For a big multicast group (figure 5) the FEC gain is nearly constant and reduces the 
average number of transmissions by 50% independent of the group size and the number 
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of shared links. The reason for the constant FEC gain is given by the number of 

transmissions with FEC (figure 6) and without FEC (figure 7). 

Figure 6 and figure 7 show also that the typically used assumption of independent loss in 
analytical evaluations of reliable multicast communication is a worst case assumption, since 
the number of transmissions decreases with the number of shared links. 




Figure 6: Expected number of 
transmissions for a big multicast group in 
tree 3 without FEC. 



Expected nuniter o( trenemieelone In Tree 3, h • 30, q1 > q2 a 0.00001 




Figure 7: Expected number of 
transmissions for a big multicast group in 
tree 3 with FEC. 



The number of links in the multicast tree is very high and the expected number of trans- 
missions Tfec in the case with FEC (figure 7) is despite a low link loss probability not close 
to one and infiuences the FEC gain in figure 5) in contrast to the case for a small 

multicast group. 

4.2 Small Multicast Groups: Partial FEC 

A class of applications including multimedia conferences or distance teaching can profit from 
multicast communication. We model the multicast tree for this class by a tree where several 
participants are located in a private network and a path traversing a WAN connects the 
source with the receivers at the remote site. 

We consider that the node where the MC tree makes the transition from the shared links 
to the dedicated links is the gateway between the world of the WAN (shared links) and the 
world of the LAN. 

We will now use tree 3 to model a multicast tree that splits up at the entry into a private 
network (LAN). In the private network, we will have a limited number of receivers r = 1 ... 30 
and the number of links we have from the branch node to every receiver will also be relatively 
small: k — 1 ... 5. Also we assume that the link loss probability in the WAN is higher 
(qi = 10“^), than in the private network {q 2 = 10“^). 

In the previous section we looked on the impact of FEC when applied from the source to 
the receivers of the multicast tree. However, FEC can also be applied selectively, covering 
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a part of the multicast tree. We consider the two cases where FEC is applied just on the 
shared links, or just on the independent links close to the receivers. Again we assume that 
FEC lowers the loss probability by two orders of magnitude. We will compare the last two 
cases in terms of gain of transmissions. Figure 8 shows that protecting the shared path in 
the WAN is more efficient in this case. The gain achieved by FEC is in both cases very low 
(under 1.03). For FEC in the WAN this is due to the low number h — A: = 25 ... 29 of links 
that are affected by FEC, for FEC in the LAN more links are affected, but the already low 
link loss probability does not lead to a big FEC gain, since the number of transmissions 
without FEC is already close to 1. 



The gain of employing FEC in the WAN or in the LAN. 




30 



Figure 8: The gain for employing FEC 
just in the WAN (upper plane), or just in 
the LAN (lower plane). 



We will model a small multicast group connected by wireless links in the private network, 
yielding a higher packet loss probability of Q 2 = 10~^ than in the WAN (gi = 10~^). The 
other parameters of the model stay the same as in the LAN- WAN case. Figure 9 shows that 
the protection of the wireless links in the LAN is the thing to do and that the application of 
FEC on the WAN part results in no FEC gain. This is mainly due to the fact that the h — k 
links in the WAN have already a small link loss probability, so that a further lowering of the 
packet loss probability in the WAN does not result in a decrease of number of transmissions, 
since the high link loss probability in the wireless LAN stays constant. The FEC gain of 
applying FEC on the wireless links is shown in figure 9 with the upper plane, where the gain 
increases with the number of receivers up to 1.8. As already stated in the results for FEC 
on the whole multicast tree the FEC gain increases with the number of independent LAN 
links k. 
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The gain ot employing FEC in the WAN or In the wireieia LAN 




30 



Figure 9: The gain for employing FEC 
just over the shared WAN links (lower 
plane), or just on the tree inside the 
wireless LAN (upper plane). 



The results in this section give rise to the assumption that FEC applied on the links with 
the highest packet loss probability is always the thing to do. In the next section we will see 
that this does not hold for big multicast groups and we will give the reasons. 



4.3 Big Multicast Groups: Partial FEC 

The case described in the previous sections was for a small group of receivers. We will now 
use tree 4 (see figure 3) to model big multicast groups where multiple receivers are located 
in several LANs. 

We have vlan = 5, 25, 100 receivers in every LAN, the total number of receivers r = 
100 . . . 3000 varies. The link loss probability is qi = q 2 = 10“^ in the WAN and qz = 10“^ 
in the LAN. The length of the common WAN path varied from h — k — I = 0...24. The 
results are shown in the figures 10, 11 and 12. We assume again that FEC lowers the link 
loss probability by two orders of magnitude. We compare the cases, where FEC is employed 
either on the WAN part or the LAN part of the multicast tree. 
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5 receivers in a LAN 



TtM gain ol amployfng FEC, h > 30. 1 « 5, rLAN > S, q1 • q2 > 0.001, q3 • 0.0001 




length of common WAN path 



Figure 10: The gain of FEC in WAN 
(upper plane) and FEC in LAN (lower 
plane). 



55 receivers in a LAN 



The gain of employing FEC, h .. 30, 1 = 5, tLAN = 25, q1 .. q2 • 0.001 . q3 » 0.0001 




length of common WAN path 



Figure 11: The gain of FEC in WAN 
(plane up to 1.2) and FEC in LAN 
(plane up to 1.6). 
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100 receivers in a LAN 



Ttw gain ol amploying FEC, h - 30, 1 - S, rLAN - 100, q1 - q2 • 0.001 , q3 • 0.0001 



25 




3000 



Figure 12: The gain of FEC in WAN 
(lower plane) and FEC in LAN (higher 
plane). 



The results for small multicast groups indicated that the FEC gain is highest, when applied 
over the links that have the highest link loss probability. This does not hold for big groups. 
The high link loss probability can be found on WAN links, but the FEC gain is higher 
for FEC applied on the links in the LANs than on the links in the WAN, for the case of 
riAN = 100 receivers (figure 12). The FEC gain achieved by the application of FEC on the 
WAN is decreasing with an increasing riAN (figure 10 - figure 12) of receivers that can be 
found in a single LAN until the FEC gain for FEC on the WAN is lower than for FEC on 
the LAN (figure 12). The reason can be found in the high decrease of links in the WAN 
with increasing tlan-, while the total number of links in all LANs is always constant (5r), 
whatever the number of receivers in one LAN tlan is. 



5 CONCLUSION 

We defined two generic multicast trees that allowed us to capture the essential characteristics 
of a multicast tree such as degree of link sharing or group size. In both cases we derived the 
analytical formula for the average number of transmissions necessary to reliably deliver a 
packet to all receivers of the group. The formula allowing to model for different link loss in 
different parts of the multicast tree. Our main results are as follows: 

When FEC is applied to all links of the multicast tree we observed that 

• for small groups the FEC gain increases as the number of receivers increases (figure 4) 
and the number of shared links decreases. 

• for big groups the FEC gain is larger than for small groups and is independent of the 
number of receivers. 
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When FEC is applied either on the links of the LAN part or either on the links of the 
WAN part of the multicast tree, we observed that 

• for small groups the highest FEC gain is achieved when applied on the part, where the 
links have a high loss probability. 

• for big groups this is only true for a low number r^AN of receivers that can be found in 
one LAN. For a large number of receivers in one LAN, FEC should be applied on the 
LAN links. 

We also showed that the assumption of independent loss among receivers is a worst case 
assumption. In [BMT94] the cumulative loss probability experienced at the source was eval- 
uated for different genetic multicast trees for the same number of links in those trees. The 
authors conclude that minimizing the degree of overlap between paths in a multicast tree 
results in more efficient reliable multicast. This is true for multicast groups that are con- 
nected to the source with the same number of links either by independent paths or by 
overlapping paths. But, in general, overlapping paths result in a lower number of links in the 
multicast tree, compared to a multicast tree consisting of independent paths. Our multicast 
tree model (tree 3) indicates that the number of links has a higher impact on the number of 
transmissions than has loss on shared links. The number of transmissions decreases with a 
high overlap of paths. 
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Abstract 

This paper analyzes the performance of a reliable multicast transport protocol and discusses 
experimental test results. The Reliable Multicast Transport Protocol has been proposed to 
support "reliable" information delivery from a server to thousands of receivers over unreliable 
networks via IP-multicast. The protocol provides high-performance for most receivers through 
the advantage of IP multicast while also supporting temporarily unavailable or performance 
impaired receivers. Its applicability to large scale delivery is examined using an experimental 
network and the backoff time algorithm which avoids ACK implosion. The two types of flow 
control with the protocol are also examined. Separate retransmission is used to offset the local 
performance decline limited to a small number of receivers. Monitor-based rate control is used 
to offset the global performance declines due to causes such as network congestion. 

Keywords 

Reliable multicast protocol, large-scale delivery, rate-based flow control, performance analysis 



1 INTRODUCTION 

The information age requires large-scale reliable information distribution functions. Scalable 
information delivery has become feasible with the advent of high-speed networks which offer 
large bandwidth. However, the number of users still affects the reliability and performance of 
information distribution systems [NE94]. 

For scalable information distribution, reliable multicast has been studied for efficient and 
reliable information distribution. A generic solution to all categories is not feasible because of 
the inhomogeneity of service requirements and different solutions have been proposed for each 
service category. Representative reliable multicast can be considered as follows based on the 
service type and needed reliability. 
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(a) Massive information delivery: Digital information prepared in a server is distributed to 
many subscribers without any data flaw with some service time constraint. Electronic 
newspaper delivery and data replication for naming or archive services are typical applications 
[OB 94]. The main research issue is scalability and efficient multicast which achieves 
complete data replication by recovery procedures [KC96], [PTK94]. This type of service is 
very promising in actual publishing services; it suits for the broadband networks that are now 
emerging [IT91]. 

(b) Group communications: Members of a group share the same information which is 
distributed to all members. Shared whiteboard is a typical application in this category. The 
shared information is updated by short multicast messages by each member. The main 
research issue is efficient concurrent multicast which guarantees the message order [BS91], 
[CS93], [EM95]. Scalability may be also required while relieving order constraint [FJM95]. 

(c) Distributed simulation: Many simulation sites execute simulations and share the results by 
multicast. Each site must distribute its state information to other sites within a limited time 
frame to effect the simulation. The main research issue is scalable multicast while keeping 
the severe time constraint [HSC95], [SK96]. 

The Reliable Multicast Transport Protocol (hereafter RMTP) [STS95], [STY96], [TS96] 
has been proposed for type (a) applications by using IP-multicast [DE90]. RMTP supports a 
"reliable” information delivery from a server to thousands of subscribed receivers. "Reliable" 
information delivery means complete data replication in the receivers and receiver confirmation 
before and after data delivery. 

This paper analyzes the performance of RMTP and discusses experimental test results. 
While the large-scale implementation of reliable information delivery is expected, few reports 
have addressed this subject for type (a) applications. Only a general case model analysis can be 
applied for large-scale delivery including thousands of receivers [PTK94]. 

The proposed protocol RMTP is realized in the transport layer with an unreliable transport 
protocol UDP over connectionless best-effort network protocol IP. Responses are reported to 
the server as end-to-end transport communications instead of using gateways (inter-mediate 
node) which gather some responses and respond to the server (root node) or upper gateways. 
The reason why gateways are not used is that they entail tree structure maintenance including 
the structure of lower gateways and leaf nodes. This end-to-end retransmission concept has 
been argued [SRC84] and is supported by most reliable multicast protocols. Furthermore, 
imposing gateway functions on the intermediate hosts or routers creates unwarranted costs 
when constructing and maintaining a large-scale information delivery system. 

RMTP is a receiver-initiated protocol in which receivers are responsible for detecting data 
packet loss and sending NACK to the server. An analysis based on a generalized model 
reports that receiver-initiated reliable multicast protocols provide substantially higher 
throughput (packet processing rate) than sender-initiated reliable multicast protocols [PTK94] 
and most reliable multicast protocols follow the receiver initiated approach [WMK], [KC96]. 
RMTP follows with this idea and fundamentally supports receiver-initiated reliable multicast. 

In addition to NACK, RMTP also uses ACK for complete receipt confirmation for the whole 
data set from each receiver. 

The ACK implosion problem; response overflow caused by response concentration to the 
server, which is inevitable with end-to-end confirmation can be solved by the backoff time 
algorithm [CZ85], [DA94]. In this paper, the applicability of the backoff time algorithm is 
examined for large scale dehvery in the framework of RMTP. 

Another intractable issue in reliable multicast for large scale information delivery is that 
receiving conditions decline by network congestion or receiver local problems. These receiving 
performance degradation factors will eventually occur in each receiver or globally in all 
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receivers. We try to resolve this issue by providing high-performance multicast to most 
receivers while still supporting temporarily unavailable or performance impaired receivers. To 
this end, two flow control procedures; separate retransmission and monitor-based rate control 
are adopted in RMTP. 

Separate retransmission is provided to offset any local performance decline experienced by 
a receiver due to causes such as occurred in wireless environment. When the performance 
decline of a receiver is detected, the server freezes the retransmission control to the receiver 
until re/transmission to other normal receivers are completed. Thus, the proposed procedure 
prioritizes the normal state receivers while still supporting temporarily unavailable receivers. 

The monitor-based rate control is used to offset the global performance decline experienced 
by all receivers; causes include network congestion. Rate control adapts to the whole receivers 
and reflects the whole data receiving conditions to the current data transmission rate. 

After examining related works in Section 2, the design concept of RMTP is described in 
Section 3. The error control part of the protocol is analyzed and compared with experimental 
results as to throughput (transfer time) and server processing load (packet processing amount) 
in Section 4. The adaptive end-to-end rate control is also evaluated through performance tests. 



2 RELATED WORKS ON RELIABLE MULTICAST 
Reliable multicast protocols 

Reliable multicast protocols applicable to massive information delivery are described below and 
compared against our RMTP. The trend in rehable multicast protocols for massive information 
delivery and group communication are reported exhaustively in [OB 94] and [DD96]. 

Multicast Transport Protocol (MTP) [AFM92] is a general purpose multicast protocol over 
UDP and IP-multicast. It bases retransmission on receiver-initiated NACK and fixed-size 
window control which causes heavy retransmission overhead. MTP also adopts dynamic 
membership control via a group master which is also responsible for retransmission. Due to 
these heavy control overheads, MTP has not been implemented. MTP's retransmission idea 
can be found in Reliable Broadcast Protocols (RBP) [CM84]. RBP was proposed for 
broadcast LANs and is not scalable beyond LANs. 

Reliable Multicast Protocol (RMP) [WMK] is a successor of RBP. RMP reduces the 
group master overhead by rotating the role of group master among all members as in RBP. 
However, its scalability is still questionable and reported tests cover less than 10 receivers. 

RMP seems to be somewhat dedicated to CSCW applications judging from its support of 
ordering guarantee and multi-RPC. 

Adaptive File Distribution protocol (AFDP) [KC96] is a promising, recently proposed 
multicast protocol for massive information delivery. AFDP is also a receiver-initiated protocol 
as is RMTP. AFDP uses only NACK for data receive confirmation, and also supports rate 
control based on NACK reporting. These data recovery and flow control ideas can be found in 
NETBLT [CLZ87] which was developed for one-to-one bulk data delivery. AFDP sends 
NACK several times for each round of data transfer, while RMTP uses it only once and sends 
only one NACK/ACK for each round. Furthermore, no amendment for NACK loss is 
supported in AFDP as is true in the previous three multicast protocols. Testing of AFDP has 
been reported for 16 receivers with 7 Mbyte data and 68 receivers with 133 Kbyte. 

XTP, which was developed primarily for high speed data transfer, also adopts reliable 
multicast. However, the XW revision 4.0 specification [XT95] leaves details of flow and error 
controls to the user implementation while sufficiently supporting multicast group management. 
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ACK implosion 

Any system that makes the information serving process dependent on responses from the 
information receiving process suffers from the ACK implosion problem [DA94]. The 
response concentration to the server results in response loss and causes redundant 
retransmission. In our protocols, the server requires explicit responses of complete receipt 
notification (ACK) or incomplete receipt notification (NACK). 

Against ACK implosion, a promising solution is the backoff time algorithm which was 
proposed by [CZ85]. Its behavior with limited socket buffers was analyzed by the queuing 
model and tested for ten real receivers and rather large size response packets [DA94]. We have 
examined and tested the proposed protocol using backoff time algorithm in a real Ethernet 
LAN environment for up to ten thousand receivers by receiver emulation including packet loss 
and network delay. The results confirm that the backoff time algorithm is effective even in 
large-scale environments with thousands of receivers, (see Section 4 ) 

Other research [PTK94] analyzed reliable multicast protocols in the form of generic model 
that uses NACK for each data packet loss. The revised protocol also uses multicast for NACK 
and reduces the number of NACKs sent to the server in such a way that receivers do not issue 
NACKs if other receivers have already sent a NACK (by multicast) for the data packet. The 
analysis showed that, while some overhead is imposed on receivers, server performance is 
improved by this amendment. However, the protocols have the restrictive assumption that 
NACKs are never lost in the network. Further, multicasting by receivers in large scale high 
delay networks will cause a flood of responses all over the network. 

A recent research [H96] introduced a different approach for the implosion problem, the 
Local Group Concept (LGC), in which receivers in a LAN consist a local group and 
coordinately execute retransmisssion. The data retransmission is achieved first in the LAN by 
a local group controller, then the data missed by the local group is reported to the server and 
the server retransmits the data to the local groups. Simulations in a restricted WAN and LANs 
showed LGC approach achieved better end-to-end transfer delay and wide area link traffic than 
the retransmission all supervised by a server. If LGC is implemented with RMTP, it will 
enhance the receiver scale order by the size of local group. 

Another recent research [LP96] has also proposed a reliable multicast transport protocol for 
massive-data delivery which is based on periodical ACKs and window flow controls. The 
protocol adopts hierarchical gateways to offset ACK implosion. The approach is completely in 
the opposite direction to the RMTP proposed in this paper. Experiments for 1 Mbyte file 
distribution for eighteen receivers in a real LAN and WAN environment has been reported. 



3 RELIABLE MULTICAST TRANSPORT PROTOCOLS 
3.1 Basic multicast retransmission procedure 

The proposed multi-round multicast retransmission procedure, which meets the requirement of 
complete reliability, is shown in the multicast transmission /retransmission phase in Figure 1. 

The whole data set such as a file, is split into multiple transport packets with sequence 
numbers. After the first round of multicasting all packets, packets not received due to error or 
loss are reported to the server from the receivers by unicasting NACK. The server determines 
the retransmission packets needed from the unreceived packet reports in NACK; packet 
duplication is prevented by referring to their sequence numbers. The server then multicasts the 
retransmission packets in the second round data transfer. Thus, retransmission is based on the 
selective-repeat procedure which requires fewer retransmission packets than go-back N 
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procedures. The server continues retransmission until no packet loss is reported by receivers. 

In the receivers, duplicated packets are neglected. 

This multi-round procedure assumes retransmission management tables to record the 
numbers of successfully transmitted packets both in the server and receivers. The server 
manages all packet conditions for each receiver. The size of the whole data set, the total number 
of data packets, and data packet size are indicated in the connection establishment phase prior to 
the data transmission phase from the server to the receivers and the information is used for 
packet retransmission management at both sides. 

The number of retransnussions as a function of the error rate was assessed. Figure 2 
shows the number of retransmissions needed to realize multicast retransmission versus the bit 
error rate (BER) or packet error rate (PER) based on the analysis in Appendix A. The error 
rate is assumed to include all transmission error and packet loss throughout the network from 
an sending end to a receiving end for each packet. Delivery amount is 2 Mbytes and the packet 
error rate is shown for the 1 Kbyte packet size. Several curves are drawn for differing numbers 
of receivers up to 500,000. 

Note on data size: A 32 page daily newspaper amounts to 2 Mbytes (text and compressed 
images). The Internet version of the NYTimesFax (A4 size 8 pages, pdf format) occupies 
about 100 Kbytes, which suggests that a 160 page book would compress to 2 Mbytes. 

Server Receiver^ -- 

Connection 
estabiishmern 

phase 

0th round 
Mutiicast 
transmission/ 
retransmission phase 
I St round 



2nd round 



Separate 

retransmission phase 



Figure 1 Communication sequence overview. 

As an example, for the 5000 receiver case, the graph shows the number of multicast 
retransmissions needed for 10,000,000 packets (2000 packets x 5000 receivers) to be correctly 
received. For the PER of 10"^ (BER 1.22 x 10"^ ), three or four retransmissions are needed 
to complete data transfer. The BER 10"^ corresponds to rather low quality networks, since the 
standard ATM cell loss and error rates for international connections being discussed for Quahty 
of Service in ITU-T SGI 3 are 3 x 10’^ and 4 x 10"^ which correspond to BERs of less than 1 
X 10'8 (a cell is 53 bytes) and the BER for packet switched networks is said to be better than 
10"^. We note that in the best-effort type networks such as the Internet, PER may decline to 
10%. Even for PER 10% , the graph shows that seven retransmissions achieve complete 
delivery to 5000 receivers. These estimations have been confirmed in the experimental tests in 
Section 4.2. 

Figure 2 also shows that receiver number does not strongly affect the retransmission 
number. Data size also has a little impact on retransmission number in the assessment 
according to the same analysis in Appendix A. 
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3.2 Reliable Multicast Transport Protocol in detail 

This section describes detailed procedures in RMTP and examines flow controls. The major 
communication phases are the connection establishment phase, the multicast transmission 
/retransnussion phase, and the separate retransmission phase as shown in Figure 1. 

3.2.1 Connection management 

An explicit connection establishment procedure is used in order to pass to the receivers the 
information needed for retransmission such as total packet number. Furthermore, connection 
establishment and release procedures are necessary to confirm the registered users prior to data 
transfer and to charge for successful data delivery. 

Membership is decided before the connection establishment phase by some other protocol 
or manually and is assumed to be constant during the data transfer phase until communication 
is complete and all connections are released. This assumption is realistic for bulk-data 
multicasting, while other interactive multicast communications such as whiteboard may require 
dynamic membership control during the data transfer phase. 

Connection establishment and release procedures in RMTP are as follows. 

The server sends connection establishment request packet (CONN) to all receivers via multicast 
based on the membership list given by the application. Each receiver sends an 
acknowledgment packet (CACK) which includes Yes (ready to communicate) or No (unable to 
communicate) to the server via "unicast". The server also uses a timer to wait for CACK 
responses and retransmit CONN by "unicast". Authentication parameters may also be 
included in the CACK for secure communication establishment. 

The server multicasts a connection release packet (REL) after each round of data re/ 
transmission and the receiver’s response is ReleaseACK (RACK). NACK or BUSY receivers 
just ignore REL packets. 

3.2.2 Response handling 

Response handling procedures in the recovery operations stated in Section 3.1 are as follows. 
The sequence numbers of all data packets not received for one round of data transmission are 
included in a NACK and reported to the server by unicast. When a receiver receives all 
packets, the receiver sends ACK to the server by unicast. The reason why RMTP uses the 
explicit ACK instead of implicitly confirming the success of data receipt by timeout (no 
NACK) in the server, is that the timeout is used for the response loss detection In case of the 
response timeout, the server sends a POLL packet which solicits ACK/NACK (see Figure 3). 
The usage of POLL saves the cases of ACK^ACK loss or a short time receiver inability and, 
consequently, stops these problems causing unnecessary multicast retransmission traffic. 

The inevitable issue for response handling is evading response implosion. There are two 
causes of response implosion. One is the frequency of ACK^ACK issued from receivers and 
this can be avoided by restricting ACK/NACK to just one time per re/transmission round. The 
other cause is receiver number and this is offset by using the backoff time algorithm as follows 
[DA94]. Each receiver holds its response for some, random, uniformly distributed delay 
period, called the backoff time, before sending it to the server. The range of the distribution is 
informed to the receivers beforehand included in the connection establishment parameters. The 
algorithm is also used for responses in connection management (CACK/ RACK). The 
applicability of the algorithm to large scale multicasting is examined in a later section. 
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The current response packet sizes of ACK, CACK, and RACK are 5, 6, and 4 bytes, 
respectively. NACK size is variable. NACK header is 8 bytes and the following NACK 
information size is (the number of lost packets and/or range symbol) x (the number or symbol 
size; 2 bytes). The range symbol is used to express the continuous burst error packets in a 
shorter NACK packet. Packet loss of 1% yields NACKs of 48 bytes for 20(X) data packets 
for random error. 

In the multicast re/transmission phase, only multicast was used in the analyses and tests, 
although unicast retransmission can be used for a small number of receivers depending on the 
priority of reducing traffic or optimizing throughput. 

3.2.3 End-to-end flow controls 

Two flow control procedures; separate retransmission and monitor-based rate control are 
applied to RMTP depending on die state of performance degradation. 

Separate retransmission 

Separate retransmission is used for the local performance decline limited to a few receivers and 
avoids to continue unnecessary redundant multicast retransmission. When the performance 
decline of a receiver is detected, the server freezes the retransmission control to the receiver 
until re/transmission to other normal receivers are completed. If one or more receivers cannot 
complete multicast data transfer procedures for some reason such as poor receiver performance 
or human interruption, separate unicast retransmission is conducted after the rounds of 
multicast retransmission finish. Such a receiver issues a BUSY packet to declare its condition 
to the server in the multicast transmission/retransmission phase. After recovering from its 
exceptional state, the receiver issues a ReceiveReady packet (RRDY) to show that it is able to 
receive retransmission packets. The states of received packets are kept in both the server and 
receiver until separate retransmission starts. 

The server itself may detect the low performance receivers by the status of NACK or the 
timeout after POLL and handle them as BUSY/RRDY declared receivers. This is effective for 
the communication suspension caused by receivers moving into low transmission quality areas 
under mobile wireless environment. 

Monitor-based rate control 

The monitor-based rate control is used for the global performance decline such that all receivers 
suffer performance dechnes due to network congestion. Rate control adapts to the whole 
receiver condition and reflects the data receiving condition to on the current data transmission 
by making monitoring procedures independent from recovery procedures. 

Window flow control has been used in one-to-one transport protocols such as TCP based 
on the continuous acknowledgment from a receiver. Window-based flow control does not suit 
NACK-based multicast recovery procedures such as AFDP [KC96] and RMTP where ACK is 
not used to report each data packet reception. Rate control in AFDP is based on the NACKs 
occasionally emitted by receivers. Since RMTP uses a NACK only once for each round of 
data transfer or retransmission, receiver state monitoring by NACK transfer is not effective. 
Accordingly, RMTP adopts monitor-based rate control which is independent from error 
recovery and sensitive to receiver performance. 

The proposed rate control scheme is executed by monitoring receipt packet numbers in 
receivers as follows. 

1. The server indicates the start of packet number measurement to receivers by setting the start 
bit of a data packet on. Each receiver starts counting the receiving packet number from the 
start packet. 
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2. The server indicates the stop of packet number measurement to receivers by setting the stop 
bit of a data packet on. When receiving the packet, each receiver stops counting the 
receiving packet number and sends back a report packet to the server including the number 
of receipt packets during the monitor period. 

3. The server collects the reports, compares them to the number of transmitted packets, and 
changes the transmission rate accordingly. The rate is defined as the number of packets to 
be continuously transmitted in a interval. The decision is done based on the trend of most 
of the receivers conditions by such criterion as a mean value of receipt packets number. 

By continuing the above scheme, the server can adaptively set the transmission rate to the 
whole receiving performance based on overall conditions such as network congestion. Effect 
of the scheme depends on the criterion which decides the change of rate in the step 3. 
Elaborated criteria are used in the experiment shown in Section 4.3. 

The implosion which may be caused by reports can be avoided by selecting representative 
receivers which send back the reports among the all receivers when very large number of 
receivers exist. Rough conditions of the whole receivers can be estimated through monitoring 
the representatives. 



4. PERFORMANCE EVALUATIONS 

4.1 Model evaluations 

Performance evaluations on connection establishment and multicast re/transmission phases are 
shown based on model analysis prior to the experimental evaluation in the next section. 

The performance of the proposed data distribution procedure was modeled and evaluated 
by the criteria of transfer time and packet processing amount, both of which are important in 
service and system design. The analysis mainly addresses the basic multicast retransmission 
procedure including the use of POLL, while the effect of BUSY is not included as it is receiver 
environment dependent. The derivation directly shows the mean values of transfer time and 
packet processing amount. 

(1) Distribution time evaluation 

Term definitions 

Nk: the number of receivers in the kth round transfer 
No: the whole number of receivers to be transmitted to 
Mk: Message length of kth retransmission (Number of packets forming the message) 

MO: the whole data set size given by application. Ml: the first retransmission data size 
Nk and Mk are obtained by the calculations given in Appendix A. 

Tk: total time of kth round transfer 

TO: the initial transfer, Tl: the first retransmission 
N'k: the number of responses (ACK/NACK) that overflow in the server or the number of first 
POLL packets for the kth round 

N"k: the number POLL packets lost during the first POLL transmission which need 
retransmission 

6: mean response producing interval at the "receiver set". NACK is assumed to be sent by the 
receiver in the same way as ACK. 1/6 corresponds to response rate of the receiver set 
according to a uniform random distribution over [0, 6 Nk ], [0, 6 N'k], and [0, 5 N"k] 
v: transfer rate at server ( packets / second); constant rate is assumed in the model 
e: packet loss rate of a packet transiting the whole network one way 
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Pol(x): probability that at least one POLL is used as received from x receivers in a round 
V(x): mean number of packets overflowing the server buffer as received from x receivers in a 
round 

Since buffer overflow rate in a round is expressed as Buf(x) = V(x) / x , Pol(x) is obtained as 

Pol(x) = 1- (l-e)x + (l-e)x Buf(x). ( 1 ) 

The first two terms are the network loss rate, and the third one is the overflow rate at the server. 
V(x) is calculated through direct queuing system calculation as follows [STY96]. 

X n-b-1 

V(x) = I Pn,b, where Pn,b= (1 - E m+b-2Cb-2 1/p*", and p= 1/ 6p. 

m=o at server. 

The timer value for the server to wait for responses from receivers is TAT + 8 Nk + a, where 
TAT: turn around time and a: a surplus time taken to set the timer. 

Optimal performance is given by a = 0, but for implementation some positive value is set. 

In the following derivation, trailing terms which express retransmission of POLL more 
than once are neglected. That is, the rate of POLL loss more than once is very small and 
POLL transmission is assumed to be at most twice for each silent receiver. 

The data transfer time for the kth round can be calculated as based on the communication 
sequences in Figure 3. 

Tk = Mk/v +TAT + 8Nk + a 

+ N’k / V + (TAT + a ) Pol(Nk) + N"k / v + (TAT +a ) Pol(N'k) + (POLL loss > once) 
s (Mk + eNk + V(Nk) + V(N'k) ) / v + ( Nk + V(Nk) + V(N'k) ) 8 

+ ( 1 + Pol(Nk) + Pol(N'k) ) (TAT + a ) (2) 

where N'k = e Nk + (1- e ) V(Nk), N"k = e N'k + (1- e ) V(N’k), 
and the higher order small terms e^, e8, and so on are neglected. 

In the last two equations for N'k and N"k, the first and second terms express the network loss 
rate of response packets and the buffer overflow rate at the server, respectively. 

In the equation for Tk, the coefficients of 1/v, 8, TAT, and a express the effects of data 
packet transmission time, backoff time, turn around time, and timer surplus time, respectively. 




Figure 3 The k th round retransmission sequences. 
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Note that if message size Mk is small the effect of backoff time dominates the data transfer 
time as the order of 5 is as great as that of 1/v. Hence, the bigger the message size M is, the 
smaller the relative overhead of the backoff time becomes. This means that the proposed 
procedure suits the distribution of large amounts of data. 

In the same way, the transfer time of connection estabhshment is obtained as 

TC= ( 1 +eN + V(N) + V(N'))/v+ ( N + V(Nk) + V(N’k) ) 5 

+ ( 1 + Pol(N) + Pol(N') ) (TAT + a ), (3) 

where N’ = e N + (1- e ) V(N), N" = e N' + (1- e ) V(N'). 

Finally, the total data transfer time can be obtained as 

T = TC + lTk. (4) 

k=0 

(2) Packet processing load 

Packet processing load at a server can be calculated in the same way as the transfer time with 
the same definitions and assumptions. Packet processing load refers to UDP packet amount 
needed to send and receive RMIP packet encapsulated in UDP user data. The packet 
processing load corresponds to CPU time used in the operating system (CPU system time). 

The number of packets sent, received, and sent-and-received for the kth retransmission is 
expressed as follows based on the communication sequences in Figure 3. 

PSk = Mk + e Nk + V(Nk) + V(N'k), PRk = Nk + (1 - e ) ( V(Nk) + V(N'k) ), 

Pk = Mk + (1 + e )Nk + ( 2 - e ) ( V(Nk) + V(N’k) ), (5) 

where N'k = e Nk + (1- e ) V(Nk) and Nk is as given in Appendix A. 

The number of packets sent, received and sent-and-received during connection 
establishment phase are 

PCS = 1+ e N + V(N) + V(N'), PCR = N + ( 1 - e ) ( V(N) + V(N’) ), 

PC = 1 + (1 + e )N + ( 2 - e ) ( V(N) + V(N’) ), (6) 

where N' = e N + (1- e ) V(N). 

The number of packets processed for connection release with no retransmission is given by 
PLS = 1 + K, PLR = N, and PL = 1 + K + N, where K is the retransmission number. 
Consequently, the number of packets sent, received and sent-and-received by the server is 

PS = PCS + £ PSk + PLS, PR = PCR + I PRk + PLR, P = PC + lPk + PL. (7) 

k= 0 k= 0 k= 0 

As the CPU system time used in an operating system can be considered to be defined only 
by the numbers of sent and received packets irrespective of the type of packets, processing load 
in the operating system is approximated by the following expression. 

CPU system time = ai PS + a 2 PR + as, (8) 

where coefficients ai, a 2 , and as are determined based on the values measured when executing 
the protocol in a real environment. The values are estimated in the next section. On the other 
hand, CPU user time consumed by the user process is considered to depend on the type of 
packets and cannot be approximated simply by the number of packets processed. 
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4.2 Experimental evaluations and comparison to the model analysis 

(1) Implementation 

We implemented RMTP on Sun workstations SS4/20, SS4/2, etc. with Solaris 2.3 OS in NTT 
Labs, and IBM RS6000 with AIX 4.1 OS in IBM TRL. Both OSs support IP-multicast and 
the programs were implemented as application processes on UNIX socket interfaces. 

Two types of RMTP program source code were independently developed in the two 
research organizations based on the RMTP protocol specification collaboratively developed by 
them. The two teams jointly conducted inter-connectibility tests in a LAN environment and 
confirmed the interoperabihty of the protocol in September 1995. 

The following performance tests were executed on the implementation in Sun WSs. The 
network environment was a 10 Mbps Ethernet LAN with three subnetworks connected by a 
router; IP multicast was supported. The tests did not include the effect of BUSY in the same 
way as the analysis in Section 4.1. 

(2) Performance tests for large-scale information delivery 

Two different emulators were used, Mars and Lares, in order to create a large scale 
communication environment consisting of hundreds and thousands of receivers. 

(a) Full-specification medium-scale emulator (Mars) 

Mars produces an environment of hundreds of receivers and high delay networks. Mars 
consists of multiple RMTP receiver processes. Mars can produce specified artificial packet 
loss and specified artificial delay. As Mars supports many full specification RMTP receiver 
processes in one workstation, the receiver buffer overflows when a relatively high transmission 
rate is set at the server. 

(b) Limited-specification large-scale emulator (Lares) 

Lares produces an environment of thousands of receivers by limiting the protocol procedures. 
Lares is an RMTP receiver process which emulates thousands of receivers and responds 
always positively with CACK, ACK, and RACK. ACK arriving process based on the backoff 
time algorithm can be created by setting an inter-packet gap in the responding Lares process. 

[Experimental results and comparison to the model analysis] 

By combining Mars and Lares we generated environments of differing receiver scale, packet 
loss rates, and network delays. The following results are based on the 1 % packet loss of Mars 
by specifying the loss rate and by setting the transmission rate so as to achieve good transfer 
performance. The go-and-back delay value (turn around time) is 100 msec (assuming two 
ATM switches and two routers over 3000 km optical networks) and 600 msec (assuming 
satelhte two links or public packet-switched network in use). UDP socket buffer of 50 Kbytes 
and UDP packet size of 1 Kbyte are used. The backoff time is used for over 1000 receiver 
cases with 5 values from 2 to 4 ms. Data size was always 2 Mbytes. 

Figures 4 and 5 show the transfer time and processing load as functions of receiver 
numbers. In the case of about ten receivers, only real receivers were used. Seven workstations 
installed with Mars, which emulates ten to twenty receivers, were used to emulate 100 receivers 
for 1(X) to 5000 receiver cases. Mars was also used for the 50 receiver case. Two to ten 
workstations with Lares, emulating 75 to 1000 receivers, were used for 500 to 10,000 receiver 
cases. Tests were done several times for the same case and the graphs in the figure are the 
mean values. 

Figure 4 shows that RMTP achieves acceptable transfer time for practical applications. 

For example, 2 Mbytes (one newspaper) can be delivered to 5(KX) users within 3 minutes. 
Furthermore, the curve linearly increases with the number of receivers for over 5000 receivers. 
This shows the limitation of the backoff time algorithm for over ten thousand receivers. 
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The transfer time obtained by equation (4) is depicted as theoretical values in Figure 4 with 
the timer surplus time a set to 0 to show optimal viues. The fact that the surplus time was set 
large enough to accept delayed responses explains the difference between observed values and 
theoretical values. However, the delay does not strongly impact the transfer time or CPU 
processing load and is neglected in the graph. 

For comparison, we tested the repeated use of four parallel FTP with no artificial packet 
loss in the same LAN environment. The parallel FTP processes consumed the full 10 Mbps 
bandwidth of the Ethernet LAN while RMTP used only some bandwidth depending on the 
selected transmission rates. Only 12 % of bandwidth was used for data transmission from the 
server for over 100 receivers using the fixed rate of 0.8 Mbps. 

In additional tests for 2 Mbyte data delivery to 100 emulated receivers, 10 % packet error 
rate case required 68 sec by five retransmissions while 0 % and 1 % cases required 22 sec with 
no retransmission and 40 sec by two retransmissions. 

As for packet processing load at the server, the same characteristics can be observed in 
Figure 5. Here, regarding the CPU system time values obtained by equation (8), coefficients 
ai, a 2 , and as can be determined from the test results. Consequently, the approximation is 
obtained as CPU system time = 7.0 x lO'^ PS + 2.3 x 10"4 PR + 0.1. The coefficients show 
that receiving a packet incurs 3 times CPU load as sending a packet does. 
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Figure 4 Transfer time in the server vs delivery scale. 
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Figure 5 CPU load in the server vs delivery scale. 
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Lastly, the impact of back off time on transfer time is shown in Figure 6. The curves 
show that the backoff time needs to be small for large numbers of receivers to ensure high 
transfer time efficiency while backoff time that are too short yields drastic overhead by lost 
packet waiting timeout. UDP buffer overflow in the server is also observed with sms^ backoff 
times in the 5000 and 8000 receiver cases even if a large enough buffer (8533 packets) is used. 
This may be caused by excessive buffer management. 



4.3 Experiences from rate control 



A performance test of the monitor-based rate control scheme was executed in the same 
network environment. Before examining the adaptive rate control, the effect of transmission 
rate on communication performance was examined. Figure 7 shows the test results of transfer 
time ratio versus transmission rate at the server observed in the test of previous section where 
the fixed rates were used. The curves show that the transmission rates must be carefully 
controlled to achieve good performance under unstable network or receiver conditions. 

The rate control tests included the case of RMTP with the proposed adaptive rate control 
and the case of RMTP with optimally selected fixed rates. The two criteria of adaptive rate 
control are adopted; (1) the transmission rate is increased or decreased depending on the 
number of receivers in which the number of lost data packets exceeds a specified value and 
(2) the rate is decreased depending on the number of receivers which do not respond in a 
specified time. The tests were executed for 10 Mbyte data delivery to real seven receivers and 
100 receivers consisting of seven workstations running Mars. 

The frequency for 20 times trials versus transfer time ratio observed is shown as a 
histogram in Figure 8. The both of 100 and seven receiver cases shows that the transfer time 
is improved by applying the proposed rate control. At the same time, variances of transfer 
time become small when the rate control is used. Thus, the monitor-based rate control scheme 
improves the transfer performance and contributes to the stability of reliable multicast. This is 
accomplished by maximizing the transmission rate while avoiding unnecessary overflow. 

Additional tests including 600 msec delay did not show much impact to transfer time in the 
same way as the tests in Section 4.2. 
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Figure 7 Transfer time vs transmission rate. 
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5. CONCLUDING REMARKS 

Error recovery procedure of Reliable Multicast Transport Protocol (RMTP), which has been 
proposed for delivering large amounts of data to large numbers of users, was evaluated through 
analyses and performance tests in LAN environment including large-scale receiver emulators. 
The tests confirmed the feasibility of proposed retransmission procedure in realizing large scale 
information delivery for thousands of receivers. Against the ACK implosion caused by end- 
to-end responses, the backoff time algorithm was applied and test results showed that the 
algorithm is applicable up to about ten thousand receivers in practical use. 

As for flow control for reliable multicast, the monitor-based rate control was addirionally 
applied to the protocol for global performance degradation and showed enhanced performance 
and stability regarding transfer time. The separate retransmission, in which some performance 
degraded receivers are detected and retransmitted after suspension, was also realized in the 
protocol. However, the criteria of applying the separate retransmission and their effectiveness 
is still under evaluation. 

As the user number increases, security issues such as data confidentiality and user 
authentication, and network administration issues such as multicast routing will also become 
important for practical multicast services. RMTP is going to be used in corporate information 
delivery systems first. Multicast administration issues are to be studied through field tests. 
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Appendix A Assessment of basic multicast retransmission procedure 

This appendix assesses how many rounds of retransmission are needed until the basic 
multicast retransmission procedure proposed in section 3.1 completes all data retransmission; 
i.e., the whole data is correctly received by the receivers. The error rate is assumed to include 
all transmission error and packet loss throughout the network from an sending end to a 
receiving end for each packet. 

[Definitions and assumptions] 

N: Number of receivers 

M: Message size (Number of packets forming the whole data set) 
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e: packet error (loss) rate (PER) for a packet multicasted through the whole network and 

received by a receiver. PER is considered as error rate in the network by assuming packets 
are not lost in the receiver if transfer speed is carefully chosen. 

Sk: Number of packets to be confirmed by the receivers in the kth round of data transmission, 
that is, number of packets that has not yet received by the receivers just before the kth 
round transmission 

Mk: Message size of kth data retransmission (number of packets) (MO = M) 

Nk: Number of receivers waiting for data retransmission in the kth round (No = N) 

Sk is expressed as follows. 

so = M N, Sk = ek M N (al) 

Mk and Nk are obtained as follows. 

Ml = {1 - ( 1 - e)N }M, Nl = {1 - ( 1 - e)M } N 

M2 = {1 -( 1 -e)Si/Mi } Ml, N2= {1 -( 1 -e)Si/Ni } Nl. 

Mk = { 1 - ( 1 - e )Sk-i/Mk-i } Mk-1, Nk = { 1 - ( 1 - e )Sk-i/Nk-i } Nk-1 (a2) 

where Sn/Mn: mean number of receivers, Sn/Nn: mean number of message in the n th round, 
(1- e ) ^: a probability that all x (= N, Sn/Mn ; number of receivers) packets copied by IP- 
multicast are received without any packet loss for a packet or a probability that all x packets 
(M, Sn/Nn ; message size) are received without any packet loss for a receiver, and 
l-(l-e)^:a probability that packet loss occur in some of x receivers to which the packet 
needs to be delivered, that is, the packet needed to be retransmitted in the next round 
or a probability that packet loss occur in some of x packets which must be received by the 
receiver, that is, the receiver needs to be delivered in the next round. 

Set (Sk, Mk, Nk ) is calculated by the above equations as in the following example. 

Example: 

Packet error rates (PER) e = lO’^ to 10“6 which correspond to bit error rates (BER) 

1.22 X 10'6 to 1.22 X lO'lO when 1 Kbyte size packet is used, since 1 PER = 8,000 BER with 
assuming error occurs randomly for any bit. M = 2000 packets, N = 5000 receivers. 

Table A Retransmission number assessment 
k \ el 10-6 10-4 10-2 



0 

1 

2 

3 

4 



( 107, 2000, 5000 ) 
( 10 , 10 , 10 ) 



( 107, 2000, 5000 ) ( 107, 2000, 5000 ) 

( 103, 906, 787 ) ( 105, 2000, 5000 ) 

(0.1, 0.1, 0.1) (103, 906, 787) 

( 10 , 10 , 10 ) 

( 0 . 1 , 0 . 1 , 0 . 1 ) 

The values of blanks are all smaller than 0.01. 



The retransmission numbers can be read as follows. 

For PER 10-6, multicast is considered to complete with the 1st round of retransmission. 

For lO""^, multicast is considered to complete with the 1st or 2nd round of retransmission. 
For 10"2, multicast is considered to complete with the 3rd or 4th round of retransmission. 

In the same way, we can estimate the retransmission numbers for higher error rate cases. 

7 retransmissions for e = 10 %, 13 retransmissions for 30 %, where M = 2000, N = 5000. 
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Abstract 

Integrated Layer Processing (ILP) has been presented as an implementation technique to improve 
communication protocol performance by reducing the number of memory references. Previous 
research has however not pointed out that in some circumstances ILP can significantly increase 
the number of memory references, resulting in lower communication throughput. 

We explore the performance effects of applying ILP to data manipulation functions with 
varying characteristics. The functions are generated from a set of parameters including input 
and output block size, state size and number of instructions. We present experimental data for 
varying function state sizes, number of integrated functions and instruction counts. 

The results clearly show that the aggregated state of the functions must fit in registers for ILP 
to be competitive. 



Keywords 

ILP, Integrated Layer Processing, performance, protocol implementation 
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1 INTRODUCTION 

Data manipulation in software is a major bottleneck for communication performance, because 
it involves memory accesses to every byte of the message data (Druschel, Abbott, Pagels and 
Peterson, 1993; Smith and Traw, 1993). Since the speed of CPUs is increasing at a faster rate than 
the latency of memory systems is decreasing, this bottleneck is becoming more severe (Patterson 
and Hennessy, 1994; Wulf and McKee, 1995). 

Data manipulation functions all have in common that they read message data, make some 
computation that transform the data, and possibly write new message data. Examples of data 
manipulation functions are checksum calculation, presentation encoding and encryption. 

Integrated Layer Processing, ILP, is a protocol implementation technique introduced by Clark 
and Tennenhouse (1990). The purpose with ILP is to reduce the number of memory accesses 
needed by the data manipulation functions. The potential reduction comes from combining the 
manipulation functions of several protocol layers into a pipeline in a single processing loop. 
For each iteration of the loop the manipulations are run in succession on a small quantity of 
data, typically a word or a couple of words. Message data is pipelined via registers between the 
manipulations and need not be accessed from memory by each function. 

ILP has been shown to considerably improve throughput for simple functions (Abbott and 
Peterson, 1993; Gunningberg, Partridge, Sirotkin and Victor, 1991; Partridge and Pink, 1993) 
and to improve throughput less for more complicated functions and when used in complete 
systems (Braun and Diot, 1995; Gunningberg, Partridge, Sirotkin and Victor, 1991). 

We will show that depending on the characteristics of the manipulation functions, an inte- 
grated implementation can perform better than or much worse than a corresponding sequential 
implementation. The key question is: When does actually an integrated implementation result 
in fewer memory accesses compared to a sequential implementation? Part of the answer is that 
there is more to consider than the memory accesses to the message data. The CPU registers 
clearly play an important role. If the result of the integration is that one of the manipulation 
functions can fit less of its frequently needed information in registers, there will be extra memory 
accesses for this information. These accesses have to be weighted against the saved memory 
accesses to message data. 

We use an experimental method to investigate how data manipulation functions with varying 
demands for CPU registers respond to integration. We generate ‘synthetic’ data manipulation 
functions from a set of parameters, such as state size, input and output block size, and the 
number of arithmetic instructions. The method gives us complete control over the selection of 
parameters and thus over the behavior of the manipulation functions. 

The main contribution of this paper is that an integrated implementation can perform signifi- 
cantly worse than a sequential implementation when the CPU registers are too few to hold the 
aggregated state from all functions. 

The rest of the paper is organized as follows. The next section describes data manipulation 
functions and ILP in detail and defines terminology used in later sections. Readers familiar with 
the topic may want to skip directly to Section 3, where we describe the experimental method. 
Section 4 presents the results of the experiments and discusses the factors affecting the relative 
performance of integrated implementations compared to sequential implementations. Section 5 
discusses the generalization of the results to computer systems in general and the last section 
contains the conclusions. 
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2 DATA MANIPULATION FUNCTIONS AND ILP 

Protocol processing can be divided into two parts: protocol control and data manipulation. 
The control part processes protocol headers and controls the state of a connection. The data 
manipulation functions operate on the user data part of a message. The manipulation function 
processes blocks of data from a message input buffer until it is empty. The transformed data is 
written to an output message buffer. The size of the input block depends on the function, and 
can vary considerably. The TCP/IP checksum uses two bytes, a byte swap function uses a word, 
an XDR encoder uses multiples of four bytes, an MPEG encoder uses 256 bytes, etc. 

Below we have illustrated with pseudo-code a data manipulation loop. The loop successively 
reads input blocks of data from the input buffer until it reaches the end of the buffer. 

for (i=0; i<input_buf fer_size; i+=input_block_size) { 
read ( input_block) 

/* MANIPULATION_INSTRUCTIONS */ 
write (output_block) 

} 

The performance of a data manipulation function depends on the access time to input and 
output buffers as well as on the processing time for the manipulation instructions. On a modem 
RISC processor there is a large delay penalty when data has to be fetched from primary memory. 
This penalty can be substantial for simple data manipulation functions. It is therefore important 
that frequently used variables and data stmctures in the manipulation are put in registers by the 
compiler. 

A data manipulation can be described by: 

• An input data block: The size of the input block is fundamental to the function. The whole 
block must be present in order to do the processing. Message data is most likely found in 
primary memory unless it has been recently accessed. The input block is loaded from primary 
memory through the cache to the registers for processing. 

• An output data block: The manipulation may expand or compress the input data block or 
produce an output block of the same size, depending on the particular function. The block 
is normally stored to the output buffer at the end of a manipulation. Some functions, like 
checksums, do not have an output data block. 

• A state: Some manipulation functions use the state from previous iterations or some initialized 
context. This is the case for the TCP/IP checksum and the DES encryption algorithm in CBC 
mode. The checksum algorithm adds a 16 bit data block to the accumulated sum of previous 
blocks, i.e., the sum is the state. Since the state variables are used for each input block it is 
desirable that they stay in registers from the processing of one block to the next. The state 
must be carried over to other messages when the data manipulation needs several messages 
to finish. 

• Table look-ups: One difference between a table and the state is that a table is not updated. A 
table is expected to be large and to have relatively few accesses to it. It is therefore likely that 
most of the table will stay in primary memory and only the cache lines which are accessed 
by table lookups will be brought into cache. These lines will however evict other cache lines 
in an unpredictable way, which will affect performance. If these lines are not accessed in a 
long time they will eventually be evicted. 
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• Constants: Constants are explicit in the code. The difference we make between a constant 
and a table entry is that the table is sparsely accessed and too large to fit into registers. The 
constants are accessed each time the manipulation is performed. They are therefore candidates 
for register allocation by the compiler. 

• Temporaries: A data manipulation function will also use a number of temporaries which are 
not visible outside the function. Depending on their usage, a compiler may assign them to 
registers. 

• Instructions: A data manipulation function has both control instructions and arithmetic-logic 
instructions. When referring to the number of instructions in a data manipulation in this paper 
we mean the number of arithmetic-logic instructions. The code for a manipulation loop is 
typically small enough to fit into cache. For unified caches, some instructions may be thrown 
out due to conflicts with data, such as table access data, which will prolong the processing 
time. Function calls in the data manipulation code reduce locality and increase the risk of 
address conflicts in the cache, see (Mosberger, Peterson and O’Malley, 1995) for a discussion 
on this effect. 

2.1 Integrated Layer Processing 

The basic idea behind ILP is to perform all the manipulations in one or two processing loops, 
instead of performing several manipulation loops sequentially as is most often done today. With 
an ILP loop we mean a loop that consists of an input data block read from memory, followed 
by a ‘pipeline’ of several manipulations and an output data block write. In such a pipeline each 
manipulation successively transforms this block of data until the last manipulation is applied. 
Thereafter the fully transformed block is written back. 

The rationale for this loop integration is that the number of time consuming memory accesses 
to input and output data buffers in the pipeline can be reduced. Instead, these accesses will be to 
the transformed block which is assumed to be located in registers. This is achieved at the cost of 
a larger aggregated state since all manipulation function states must be kept for the next input 
block to the pipeline. 

For design of and implementation details on ILP loops we refer to Braun and Diot (1995), 
Abbott and Peterson (1993) and to Gunningberg et al (1991). 

An ILP loop is illustrated below. Assume n manipulations to be integrated in the loop. They 
are sequentially ordered, such that manipulation i is executed before manipulation z + 1. 

for (i=0; i<input_buf fer_size; i+=input_block_size) { 
read ( input_block) 

/* MANIPULATION_INSTRUCTIONS from function 1*/ 

/* MANIPULATION_INSTRUCTIONS from function 2*/ 

/* ... */ 

/* MANIPULATION_INSTRUCTIONS from function n*/ 
write (output_block) 

} 

For very simple functions with few instructions, such as the TCP/IP checksum, byte alignment, 
etc, the time to read and write to memory dominates the execution time. For other functions, 
like encryption and some presentation encodings, the manipulation time may dominate. The 
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relative speed-up with an ILP loop for simple functions can be substantial while it is marginal 
for complex functions. 



3 EXPERIMENTAL METHOD 

We use the time (in nanoseconds) to process a byte as the performance measure of a data 
manipulation function. This measure depends on the number of instructions executed and the 
average number of clock cycles needed per instruction. The number of cycles depends not only 
on the instruction, but also on parallelism and data dependencies in the instruction pipeline, and 
how many cycles the CPU has to wait when referenced data is in cache or in main memory. It is 
the job of the compiler to: (1) reduce the number of instructions, (2) schedule the instructions to 
avoid a stalled pipeline, (3) allocate registers in the most efficient way and (4) to generate code 
and data addresses with good locality. A skilled programmer can certainly structure code and 
data in order to lower the number of cache misses and for efficient register usage. 

3.1 Generating data manipulation functions 

To be able to study data manipulation functions with varying characteristics we want to au- 
tomatically generate the desired functions from a set of parameters, such as state size, block 
size and count of arithmetic instructions (described in the previous section). We therefore have 
implemented a generator program that takes such a set of parameters as input. The generator 
program produces ‘synthetic’ code which behaves according to the parameters, but does not 
do anything else useful. We actually implemented two generator programs, one producing an 
integrated implementation and the other producing a sequential implementation from the same 
set of parameters. 

The generator programs produce C code to make the method more portable to different 
architectures. The drawback is that much care must be taken to make sure that the particular C 
compiler produces the desired machine code. When generating code that does not do anything 
useful, it is easy to accidentally produce ‘dead code’ that is optimized away by the compiler. 
The strategy of the generator programs is to minimize the use of temporary variables and make 
sure that the output is dependent on all input and on the result of all arithmetic instructions. 

To validate that the generator programs produce code that is compatible to real data ma- 
nipulation functions with respect to throughput, we have made a comparison with the BSWAP/ 
PES/CKSUM combination of functions from Abbott and Peterson (1993). The parameters of the 
BSWAP/PES/CKSUM functions were used as input to the generator programs and the result was 
compared to the real implementations (see Table 1) on our two measurement systems, the HP 
9000-735/90 and the SPARCstation 2. On the HP, the synthetic implementations performed 
virtually identical to the corresponding real implementations. On the SPARCstation there was a 
small discrepancy which we traced the reason for to the C compiler making a slightly different 
register allocation. 

3.2 Measurement method 

We have implemented a measurement program whose core is a loop, or a set of loops, in which 
different code easily can be inserted, in particular the parameterized synthetic data manipulation 
code described above. In the integrated case, there is a single loop. In the sequential case, there is 




172 



Part Five 



ILP 



Table 1 Performance comparison of synthetic and real implementations of the BSWAP/pes/ 
CKSUM loop. 



Mbit/s 


cold cache 
Seq. ILP 


warm cache 
Seq. ILP 


HP real 


83.9 


163.8 


147.3 


217.7 


HP synthetic 


83.9 


163.8 


147.3 


218.5 


SS-2 real 


23.2 


46.0 


34.8 


64.3 


SS-2 synthetic 


22.8 


43.0 


33.9 


58.3 



a set of loops, one loop per function. The integrated loop, or each sequential loop, runs once over 
an input buffer and possibly produces data in an output buffer. The program measures the time 
over single invocations of a loop using the standard gettimeofday ( ) system call. In both 
our measurement systems, SunOS 4.1.x and HP-UX Rel. 9, this call has a 1 //s resolution. The 
results presented below are median values of 1000 measurements corrected for measurement 
overhead. The median value is often only one or two microseconds larger than the minimum 
measured value. The maximum value can, however, be almost any number because of interrupts 
and scheduling of other processes. The mean value is therefore less interesting. The buffer size 
is chosen so the measurement is approximately in the range 500 /zs to 5 ms. The lesser value 
is large enough to give accurate correction for measurement overhead, and the larger is small 
enough to be possible to run without getting disturbed by interrupts. 

The program gives us complete control over how the input and output buffers are allocated 
and located in the cache. In the measurements in this paper, all buffers, the stack and global 
variables are allocated in such a way that they do not conflict in the cache, i.e., so they do not 
compete for the same cache location. For the SPARCstation 2, which have a unified instruction 
and data cache, we have also made sure that the executed code, both in the measurement program 
and in the gettimeofday ( ) system call, do not conflict with the buffers, the stack or the 
global variables. 

The program also controls the cache temperature of the buffers. If ‘warm’ is selected for a 
buffer, the program reads the buffer before the measurement loop to load the buffer into the 
cache. If ‘cold’ is selected, it flushes the buffer from the cache. In the measurements below, 
‘warm cache’ means that all buffers are warm, and ‘cold cache’ means that all buffers are cold 
and, in the sequential case, that the buffers also are flushed from the cache between each loop. 

We experiment with four cases: an ILP loop and sequential loops with warm and cold caches 
respectively. In a real implementation the cases with warm cache correspond to a situation where 
input data is already touched, for example by an application, and that the output buffer already 
is in cache, for example it could be a buffer from a previous run of the loop which is reused. 
Furthermore, in the sequential case, the next manipulation function will use this warm buffer as 
its input buffer. This is an ideal, best case situation, where none of the output buffers are evicted 
before it is accessed by the next function. It may only be achievable when manipulation loops 
are run directly after each other. 

In the cases with cold cache, input data to the manipulation loop has to be fetched from 
primary memory. In the sequential case all output buffers are also evicted from the cache before 
the next manipulation function is run. This could be the situation in a real implementation when 
the data manipulation functions are separated in the protocol stack. In most implementations we 
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expect that some cache blocks are evicted and some stay in cache. The performance will then be 
somewhere in between the two extremes. By comparing these cases, we can estimate the effects 
of caching on performance. 

3.3 Dependency on the compiler 

We discovered, not surprisingly, that the measurement results are very dependent on the particular 
compiler used. On both our platforms, SunOS and HP-UX, we used the stock ‘cc’ compiler. We 
carefully checked the code produced by the compilers in order to make sure that it was what we 
expected. 

The two compilers each have their peculiarities. The SunOS compiler is not very good at 
register allocation. This results in more variables than necessary allocated on the stack and code 
with additional loads and stores of these variables. This shows up in the measurements as a 
performance decrease when there are many local variables. The HP compiler on the other hand 
sometimes decides to slightly unroll loops by duplicating the loop body two or three times. This 
usually shows up in the measurements as a small performance increase. 

When these features of the compilers affect the measurements described in the next section, 
it is pointed out and the effects are discussed. 



4 FACTORS AFFECTING ILP PERFORMANCE 

We have studied three factors that affect the relative performance of an ILP implementation 
compared to a sequential implementation. The experiments and results for each of the factors 
are presented in the following subsections. 

The first factor is the number of instructions per data manipulation function. The results are 
the expected and show that the number of arithmetic instructions per function does not affect 
the absolute performance difference. 

The other two factors 2 Sq function state size and the number of functions. Both factors affect 
the number of CPU registers that is used to hold the state needed in the loops. The results clearly 
show that this has a large impact on performance. 

4.1 Instructions 

In this experiment we vary the number of arithmetic instructions for each data manipulation 
function and observe the processing time per byte. The other parameters are kept constant. The 
input block size is set to 1, the number of state variables is set to 2 and the number of functions 
is set to 3. The aggregate state of the functions is small enough to fit into registers. 

We vary the number of arithmetic instructions in each function, from 4 up to 100 instructions. 
For simplicity, we use the same number of instructions in each function. We have measured ILP 
loops as well as sequential loops for both warm and cold caches. 

We expect the processing cost to be linear to the number of arithmetic instructions in the 
loops, since these arithmetic instructions only add to the CPU execution time, and do not affect 
the number of memory accesses. The code for the loops fits in the cache and are expected to 
stay there after an initial load. 

In Figures 1 and 2 the measurement results for SPARCstation 2 and HP 9000-735/90 are 
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Instructions 



Figure 1 The processing cost of increasing number of arithmetic instructions per function 
(SPARCstation 2). 




Instructions 



Figure 2 The processing cost of increasing number of arithmetic instructions per function (HP 
9000-735/90). 
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presented. The unit on the X-axis is the number of arithmetic instructions in each function. The 
unit on the Y axis is the processing cost measured in ns/byte. 

The results are generally what can be expected. The processing time for a byte basically scale 
linearly with the increasing number of arithmetic instructions needed for the manipulation. Both 
systems exhibit small deviations from the linear behavior. This is mainly due to the low-level 
instruction scheduling in the processor, and how well the compiler can take advantage of this 
scheduling. 

In each figure there are four plots for the cases ILP loop with cold and warm cache and 
sequential loops with cold and warm cache. The plots are parallel as expected, with better 
performance for warm cache than cold. The ILP loops always perform better than the sequential 
loops for the measured selection of parameters. 

The difference in processing cost between the ILP loop and the sequential loops should be: 
(1) the absolute g 2 iin for ILP when loads and stores to memory are reduced, and (2) the loop 
overhead avoided by ILP by having one single loop instead of one per function. For three 
functions, two loads and two stores can be avoided using ILP. The net gain will be higher for 
cold cache than for warm cache, since a cold load from main memory is more expensive than a 
load from cache. 

The ILP gain for the SS2 is about 220 ns/byte for cold cache (135 ns/byte for warm) and 
for the HP about 30 ns/byte for cold cache (20 ns/byte for warm). This gain is substantial for 
an ILP loop with a small number of arithmetic instructions. For example, for ten instructions 
the throughput gain for ILP is 65% for the SS2 and 60% for the HP. For large manipulation 
functions, at the other end of the spectrum, the gain is more marginal. For example, for 100 
instructions the gain is 1 1% for the SS2 and 4% for the HP. 

The benefit of warm caches can also be observed in the figures. The SS2 benefits more from 
a warm cache than does the HP. The sequential case gain more from a warm cache, as expected 
since there are two more loads and stores for every input block as compared to the ILP case. 

For ILP on the SS2 the gain from warm cache is about 44 ns/byte. On the HP the penalty 
for accessing the input block in primary memory is surprisingly small for a high performance 
RISC computer. It varies between 10 and 15 ns/byte for the ILP case and 25 to 40 ns/byte for 
the sequential case. The penalty is slightly higher for functions with few arithmetic instructions. 
This is probably caused by a stalled instruction pipeline. For few arithmetic instructions, the 
compiler has problems to schedule the loads and stores in an optimal way. That is not the case 
for ILP, because the compiler then has more arithmetic instructions to schedule between the load 
and store instructions. 

The first conclusion from this experiment is that the absolute gain with ILP is not affected by 
the number of arithmetic instructions. The second conclusion is that the cost of a cold cache is 
constant, and that for the HP, this cost is small. 

4.2 State 

The state of a data manipulation function is the information in the calculation that needs to be 
retained between each iteration. For example, the state of a checksum function is the partially 
calculated checksum. 

In this experiment we vary the size of the state to see how it affects throughput while keeping 
other parameters constant. The values of the constant parameters are: 3 functions, 10 arithmetic 
instructions per function and a 1 word block size. 

The expected behavior is that when the state size is increased over some threshold defined by 




176 



Part Five 



ILP 




Figure 3 The processing cost of increasing state size (SPARCstation 2). 



the number of available CPU registers, parts of the state must be stored in memory. Thus, the 
cost of having additional state will increase, i.e., the throughput will decrease. It is worth noting 
that the state stored in memory will be allocated on the stack and is most likely to be cached 
since it is accessed at least once for each block. In our experiments we know that the state will 
be cached, since we control the placement of code, data and stack to avoid cache conflicts. 

The integrated implementation with 3 functions has an aggregated state which is 3 times that of 
the sequential implementation. The register threshold will therefore be reached for an aggregate 
state size 1/3 of the threshold state size for the sequential implementation. For each word the 
state of each of the functions is increased over this threshold, the integrated implementation will 
need to load and store 3 extra words from/to memory, 1 per function. 

When state sizes are further increased, eventually also the states of the sequential functions 
will not fit in registers. After this point, the additional cost for further increased state sizes 
will be the same for BLP as for the sequential implementation (1 extra load and store for each 
additional word of state). This is however not shown in the measurements below, because when 
this happens, DLP already behaves so much worse than a sequential implementation that ILP is 
not a viable implementation alternative. 

As an illustrating example, assume a system with 9 available registers and 3 functions to 
integrate. If each function has its state size increased from 9 words to 10 words, this adds a total 
of 3 loads and 3 stores for each data block to a sequential implementation as well as to an ILP 
implementation. However, while the sequential implementation can keep the rest of the state for 
each function in registers, ILP already before the increase has to perform 19 loads and 19 stores 
on the aggregated state (27 words) for each data block. 

The experimental results presented in Figures 3 and 4 clearly show the expected behavior. 
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state size (words) 



Figure 4 The processing cost of increasing state size (HP 9000-735/90). 



The unit on the X-axis is the size in words of the state for each of the three functions. 

For the SPARCstation 2 (Figure 3), the cost in nanoseconds is almost constant for the 
sequential implementations. The cost for the integrated implementations is almost constant 
for state < 2 words. When the state reaches 3 words, the number of available registers are not 
enough to hold the aggregated state. There is actually room for a larger state in the CPU registers, 
but the compiler does not do a very good job with register allocation. The better than expected 
performance for state = 5 is due to the compiler ‘accidentally’ doing a better job allocating the 
registers. The non-optimal register allocation does however not affect the general result, it only 
shifts the threshold point. 

The HP 9000-735/90 (Figure 4) shows a similar behavior. The CPU has more available 
registers and the compiler allocates registers better, so the threshold for the integrated imple- 
mentations is higher than for the SPARCstation. The reason for the slight decrease in cost for 
larger state size is that the compiler suddenly thinks it should slightly unroll the loop. 

We can also note that, as in the previous experiment, the effect of warm versus cold cache is 
a constant factor, smaller for ILP than it is for the sequential implementations. 

The first conclusion from this experiment is that the performance of ILP starts to decrease 
rapidly when the aggregate state size does not fit in registers. The second conclusion is that the 
cache temperature is a constant factor. 
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4.3 Functions 

The previous experiment showed how state size affects performance for a fixed number of 
integrated functions. This experiment will show how the number of integrated functions affect 
performance for different state sizes. 

We vary the number of functions for a selected set of state sizes while keeping the other 
parameters constant. The values of the constant parameters are: 10 arithmetic instructions per 
function and a 1 word block size. 

We assume that each of the integrated functions have the same state size. Since the factor 
affecting performance for ILP in this experiment is the sum of the function state sizes rather than 
the size of each individual function, these results are for ILP applicable to any set of functions 
where the sum of the state sizes add up to the same total size (when other function characteristics 
remain the same). For the sequential implementations, the results are applicable to the same sets 
of functions as for ILP above, as long as none of the executed functions have a state size larger 
than the number of available registers. 

The expected behavior in this experiment is that as long as each function, or integrated set 
of functions, can keep its active state in registers, an additional function will introduce to a 
sequential implementation the additional costs of the extra loads and stores needed to access the 
data blocks in the message buffer, the loop overhead for the new function, and the arithmetic 
instructions executed in the added function. For the ILP implementation, the only overhead 
will be the arithmetic instructions executed in the added function. Hence, if ILP can keep its 
aggregate state in registers, ILP will perform better than a sequential implementation when 
adding a new function. 

However, as was discussed in the previous experiment, the aggregate state of an ILP imple- 
mentation is the sum of the states of all the integrated functions. Thus, the aggregated state 
is more likely to grow beyond the register capacity for an ILP implementation than it is for a 
sequential implementation. When this happens, the extra cost for ILP will be an additional load 
and store operation for each item that does not fit in registers. When this cost is larger than 
the additional cost of a sequential implementation (as above), ILP will perform worse than a 
sequential implementation. 

Below we present experimental results for the two measured systems. In order to make the 
figures clearer, we have chosen to include only measurements with warm cache. As in the 
previous experiments, measurements with cold cache show a similar behavior as with warm 
cache, only with larger absolute costs because of the additional cache misses. 

In each figure we show the behavior for four interesting state sizes. These sizes are different 
for each system because of the different number of available registers. For the sequential 
implementations, the performance difference between state sizes is so small that for clarity we 
show only one (the worst behaving) sequential implementation. 

The experimental results presented in Figures 5 and 6 show that the behavior is the expected. 
The unit on the X-axis is the number of integrated functions. 

On the SPARCstation 2, the increase in cost for an additional function is almost linear for the 
sequential implementations. For ILP, the cost is dependent on the total state size reached. At 
some point, the total state size exceeds the number of available registers. After this point, the 
cost for ILP grows faster than before the point. For state sizes larger than 2 per function, ILP 
costs then grow faster than sequential costs. 

The expected growth of ILP costs should be proportional to the state size in each function. 
However, we can again see that the SPARCstation compiler is not very good at allocating 
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5 The processing cost of increasing the number of integrated functions (SPARCstation 




Figure 6 The processing cost of increasing the number of integrated functions (HP9000- 
735/90). 
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registers. For example, for BLP state size 5 words, we would expect the tangent of the linearly 
growing cost to cut the sequential cost somewhere between 1 and 2 functions. The failure of 
the compiler to use the available registers optimally adds extra load and store costs to all points 
above 1 function. This effectively moves the plotted line upwards in the figure. 

The HP 9000-735/90 also exhibits the expected behavior. For state sizes 4 and above, ILP 
costs grow faster than sequential costs. On the HP, the compiler does better register allocation, 
resulting in more linearly increasing costs. 

The conclusion of this experiment accentuates the results from the previous experiment in 
showing that when the ILP aggregated state cannot fit in registers, ELP performance declines 
even though we in these experiments make sure that the state not in registers remain in the cache 
between iterations. 



5 DISCUSSION 

The results presented in this paper can be generalized to other systems with different charac- 
teristics. The faster the CPU is relative to the speed of the memory system, the higher will the 
relative cost of loading and storing data be, thus accentuating the desire to avoid loads and 
stores. The number of CPU registers available will move the state threshold at which ILP starts 
to behave worse than a sequential implementation. The more available registers, the larger the 
state threshold. 

The cache architecture, i.e., the size, associativity and whether unified or not, will mainly 
affect the probability of cache misses. In the presented experiments we have controlled the 
direct-mapped caches of our measurement systems to avoid conflicts. The results we present are 
thus similar to results from systems with associative caches (where conflicts do not occur). Not 
controlling direct-mapped caches will result in probabilities of having cache conflicts. Cache 
conflicts would further accentuate the difference between situations where state fits in registers 
and situations where it has to be loaded and stored from memory, since the cost of accessing 
conflicting locations will increase. 



6 CONCLUSIONS 

We have shown that the aggregated state size of the data manipulation functions is a crucial 
factor influencing the performance gain, or loss, for an integrated implementation compared 
to a sequential. When the aggregated state does not fit in processor registers, the integrated 
implementation quickly becomes much slower than the sequential implementation. 

Complexity in the form of many arithmetic instructions does, however, not affect the absolute 
performance difference between an integrated and a sequential implementation. 
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Abstract 

ALF (application level framing) and ILP (integrated layer processing) are protocol design and 
implementation concepts applied in high-performance communication architectures, e.g. to 
support multimedia applications. Writing ILP code is rather complex and, therefore, ILP code 
generation tools can reduce the time to develop efficient ILP protocol code significantly. This 
paper presents a tool, which allows automated ILP code generation based on high-level speci- 
fications of user data types. The tool is based on a stub compiler and integrates ILP loops into 
(un)marshalling routines automatically. 



Keywords 

protocol implementation, automated code generation, high-performance communication, pro- 
tocol architecture 



1 INTRODUCTION 

The performance of processors has been increasing faster than the performance of memory 
during the last few years. The resulting memory bottleneck can be reduced by avoiding mem- 
ory access, e.g. by eliminating copy operations from protocol implementations. Another 
approach is to use on-processor caches, which are almost as fast as processor registers. Data 
and instruction caches are currently in the range of a few kbytes, but it is expected that fast 
caches of future processors will grow into the range of several Mbytes. Extern^ cache memo- 
ries already have sizes of several Mbytes. 

Integrated Layer Processing (ILP) tries to avoid memory access and to use caches to store 
intermediate results of data manipulation processing, e.g. encrypted or decrypted data. ILP is 
an implementation concept which allows to implement all data manipulation steps in one inte- 
grated processing loop (ILP loop) (Clark, 1993). Theoretically, an ILP protocol stack imple- 
mentation reads once from main memory, keeps the read data within registers or cache 
memory, and performs all the data manipulations for several protocol layers within the ILP 
loop. Processed data are finally written back to the destination memory. In this ideal case, ILP 
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requires only one read and one write access to the main memory for each basic processing 
unit, which is usually a 32- or 64-bit-word. 

The Application Level Framing (ALF) concept has been developed together with ILP and pro- 
vides several features that allow ILP. ALF means that the application breaks the data into suit- 
able aggregates called application data units (ADUs), and the lower levels preserve these 
frame boundaries as they process the data. Therefore, no segmentation, blocking, or other size 
changing functions are dlowed. Another advantage of the fact that an ADU is the logical unit 
of all protocol functions is that an ADU can be processed highly independent of other ADUs. 
This prevents buffering packets in packet queues as it is for example necessary for reassembly. 
Buffering is required if packets are split into several packets or if several packets have to com- 
bined into a single one. It has been shown that ALF increases the possibility that data are 
within the cache while being processed by a pipeline of consecutive protocol functions (Braun, 
1996). In particular, with small packet sizes and the expected increase of future on-processor 
caches the probability increases that a packet fits into an on-processor (first-level) cache. 

Writing ILP code by hand is not a trivial task. There are several issues increasing the complex- 
ity of ILP code: 

• Different data manipulation functions manipulate application data in units with different 
sizes. For example, checksum calculation operate on 16 bit boundaries and encryption 
functions usually work on 64 bit boundaries. In this case, 4 basic checksum operations 
must follow an basic encryption operation to process the basic processing unit of 64 bit. 
An encryption function working on 48 bit and a checksum algorithm working on 32 bit 
would result in 96 bit basic processing units. 

• ILP code is usually highly optimized in order to achieve efficient code. 

• Integrating code from different protocol layers may violate the implementation modular- 
ity and limit the flexibility. For each combination of data manipulations, new ILP code 
has to be written. For a large number of possible combinations this may be very time con- 
suming. 

Automated ILP code generation is an approach for solving some of the problems outlined 
above. A simple but limited approach of automated ILP code generation is a macro preproces- 
sor (Massey, 1993). Another approach which is the one we present in this paper is the exten- 
sion of a stub compiler to generate the ILP protocol code. This means that (un)marshalling 
routines generated by the stub compiler are extended to integrate processing of data manipula- 
tion routines. 

Generally, protocol compilers generate automatic code derived from specifications using 
description languages such as SDL or LOTOS. The main problem of automatic code genera- 
tion from protocol specification is the performance which is often much worse than manually 
generated code. One example for efficient code generation based on formal specifications is 
described in (Castelluccia, 1996). This paper gives also a good overview of related work in 
automatic protocol generation. 

The following section gives an overview of the INRIA stub compiler (ISC) (Braun, 1996), 
which was the basic tool used to develop our ILP compiler. Section 3 - Section 5 describe the 
ISC extensions and the tools developed in order to support automated ILP code generation. 
The ISC extensions and tools have been developed in three steps. Each section describes one 
of these steps. Section 3 describes the extensions of ISC for ILP loop integration into the 
(un)marshalling routines. It shows how ILP loop macros are inserted into the (un)marshalling 
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functions. These ILP loop macros have to be written either manually or generated by a tool. 
Section 4 introduces the tool we developed to generate ILP loop macros automatically. This 
tool has also been extended in order to support an advanced ALF/ILP protocol architecture. In 
that case, a single procedure is generated for sending and receiving an ADU performing 
(un)marshalling, the other data manipulations, initializations and error checks. The extensions 
are discussed in Section 5 and, finally. Section 6 describes the performance benefits of the 
developed tool. 



2 INRIA STUB COMPILER 



In contrast to other stub compilers which are using ASN.l or XDR language as input lan- 
guage, the INRIA stub compiler (ISC) (Hoschka, 1994) uses an annotated C language to 
describe data types. ISC generates routines to convert the data into an appropriate network rep- 
resentation format. Currently, ISC supports PER and XDR network formats. 



The architecture of ISC is shown in Figure 1. The user describes the format of the messages 
exchanged by the application in the file IFh.anno using the C language with additional annota- 
tions. ISC generates three C files: The first file contains type specifications (IF.h), the second 
the encoding functions to convert messages from its local data representation into the XDR 
network format (IF_xdrc.c), and the third one contains the decoding functions to decode a 
message from the XDR network format into its local data representation {IF__xdr±c). In the 
final step, a C compiler translates the C files into object code. TTie object code can be linked to 
an application in order to provide (un)marshalling functions. 



IF,h,aiiriO 



IF.h 

IF_xdn3,c 

IF„xdrc.c 



lF_xdrd.o 

IF_xdric.o 




input files 



files generated 
by the stub 
compiler 



Figure 1 The INRIA Stub Compiler (ISC) (Hoschka, 1994) 
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3 ILP EXTENSIONS TO ISC 
3.1 Architecture 

The ILP extensions to ISC (ISC/ILP) described hereafter extend the marshalling and unmar- 
shalling routines by integrating data manipulations processing. ILP performs several data 
manipulations such as en(de)cryption or checksum calculation within a single loop. The ILP 
extensions for ISC allow that together with (un)marshalling other data manipulations such as 
encryption or checksum calculation are performed on the data. 

The data manipulations for sending are performed after marshalling a message, the data 
manipulations for receiving are performed before unmarshalling. The extensions only allow 
data manipulations, which are logically on a lower level than (un)marshalling (Figure 2). Fur- 
thermore, ISC/ILP supports XDR only as network representation format. XDR has the charac- 
teristic that it is based on 4 byte boundaries. Therefore, the unit of 4 bytes is selected as the 
minimum basic processing unit size for all data manipulations. Figure 2 shows the relation 
between (un)marshalling and the other ILP data manipulation operations: 

1. The marshalling operation reads the data to be marshalled from the main memory into 
registers or cache memory. 

2. The marshalling operations are performed on the registers and the marshalled data are 
stored again in the registers. 

3. The other data manipulations combined into the ILP loop operate on the registers. 

4. The data is finally written into the destination memory. 

The processing steps for received data are the following: 

1 . Received data is read into the registers. 

2. The data manipulations are performed on the registers. 

3. Finally, the unmarshalling operation writes the unmarshalled data into the destination 
memory. 
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Figure 2 Integration of (un)marshalling routines and data manipulations 
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ISC/ILP requires as input a description of the data types using an annotated C language. In 
Figure 3, the file IEh.anno contains this data type description. ISC/ILP generates the C files of 
the (un)marshalling procedures by adding some additional code fragments, which allow ILP 
data manipulations during (un)marshalling. In particular, some macro patterns for the ILP loop 
are inserted into the (un)marshalling routines. 



These ILP loop macros (data_manipulations_send and data_manipulations_receive) 
must contain the code for the ILP loop to be integrated. They have to be defined in the include 
file IFJlp.h. This file must be included into the (un)marshalling procedures {IF_xdrc.c and 
IF_xdrd.c). Example code for the (un)marshalling procedures is given in Section 3.2. ISC/ILP 
also generates an include file IF.h containing the final C structures for application programs. 



lEb.anno 
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IF_jtdn:x 



IF_xdfd.o 

IF_xd?c.o 




Figure 3 ILP extension of the INRIA Stub Compiler 



3.2 ILP marshalling procedures 



ISC/ILP generates the files IF__xdrc.c and IF_xdrdc containing the (un)marshalling functions. 
The data_manipulations_send macro pattern and the data_manipulations_recv macro 
pattern are inserted into the marshalling and unmarshalling routines, respectively. The applica- 
tion programmer has to generate an include file IFJlp.h to define the appropriate data manipu- 
lations macros, which have to be executed for marshalling and unmarshalling the data types 
described in the annotated C file IFh.anno. Section 4 describes a tool that can be used to gen- 
erate the macro patterns automatically. 

The interface of the data manipulation macros for sending and receiving are shown in the fol- 
lowing code example of the files IF_xdrc.c and IF_xdrd.Cy which are generated by ISC/ILP. 
The parameter in tr eg is the 4 byte register, to which the marshalled data are written, while 
write_ptr and read_ptr are pointers to the final memory address of the data. The parameter 
iip_var is a variable for internal data manipulation purposes, the data structure must be 
defined within the file IFJlp.h. The ilp_var_type structure may contain some components 
required for the several data manipulations, e.g. an intermediate variable for storing the result 
of the checksum calculation. 
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The ILP macros are embedded within #ifdef - #endif statements in the files IF_xdrc.c and 
IF_xdrc.c. A macro is inserted after writing each 4-byte- word to the destination memory. The 
marshalling output is written into a register and all the data manipulations on this register are 
performed before writing it to the final memory. The following code shows the (un)marshal- 
ling routine for the data type datatype. 

The macros are inserted in the unmarshalling functions before each 4-byte-word is unmar- 
shalled. The selection whether ILP data manipulations are included using inlining or not is 
done by a compiling the IF_xdrc.c and IF_xdrd.c files with the corresponding compilation 
directive ilp. 

IF_xdrc.c: 

int * DATATYPE_xdrc ( i lp_var , write_ptr, type_ptr) 

ILP_VAR_TYPE * ilp_var; 

int * write_ptr; 

DATATYPE * typeptr; 

{ 

#ifdef ILP 

int intreg; 

#endif 
#ifdef ILP 

intreg = ...; /* marshalling and storing the next 4 bytes in register*/ 
/* ILP data manipulations and write data according to */ 
/* the write_ptr to the memory */ 

data_manipulations_send( intreg, write_ptr, ilp_var) ; 

write_ptr++ ; /* increment the write_ptr */ 

#else 

*write_ptr++ = ...;/* write marshalling output directly to memory */ 
#endif 

return (write_ptr) ; 

} 

IF_xdrd.c: 

int * DATATYPE_xdrd ( i lp_var , read_ptr, end _ptr, type_ptr) 

ILP_VAR_TYPE *ilp_var; 

int *read_ptr, *end_ptr ; 

DATATYPE *type_ptr; 

{ 

#ifdef ILP 

int intreg; 

#endif 

int *read__ptr = (int *) type_ptr; 

#ifdef ILP 

/* read received data, perform ILP data manipulations */ 
/* and store the result in the register */ 

data_manipulations_receive (read_ptr, intreg, ilp_var) ; 
read__ptr++ ; /* increment read_ptr */ 

... = intreg; /* unmarshalling */ 

#else 

... = *read__ptr++ ; /* read and unmarshall directly */ 

#endif 

return (read_ptr) ; 

} 



The application programming interface (API) of the encoding and decoding functions for each 
of the (un)marshalling functions can also be found in the code example. The first parameter of 
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both functions is a pointer to a variable (iip_var) for internal data manipulation purposes. In 
addition, the marshalling function requires pointers to an output buffer (write_ptr) and to the 
variable to be marshalled (type_ptr). It returns a pointer to the end of the output buffer after 
marshalling. The unmarshalling function needs similar parameters: a pointer to the input 
buffer (write_ptr), a pointer to the end of the input buffer (end_ptr), and a pointer to a vari- 
able for the unmarshalled data (type_ptr). The function again returns a pointer to the end of 
the unmarshalled message. 



4 ILP LOOP GENERATION 

The purpose of this tool is to combine the various data manipulation functions into a single 
ILP loop that will be provided to the modified ISC/ILP stub compiler by the macro library 
ilp.h in Figure 3. ILP is only efficient, if the data manipulations are inlined and if function calls 
are avoided (Braun, 1995). Using function calls would cause N function calls for each process- 
ing unit with N as the number of different data manipulations. Therefore, it is necessary to 
implement all data manipulations within a macro library if it is not sure that the compiler 
expands all functions to be inline macros. 

A major problem with generating the ILP loop is putting all the data manipulations into the 
correct order and connecting them to overcome the mismatch of the different processing unit 
sizes of the various data manipulations. For this task an ILP configuration tool ilpconfig has 
been developed. Several characteristics as the order and the processing unit sizes of the vari- 
ous data manipulations have to be specified using a special but simple configuration language. 

4.1 ILP configuration tool 

The ILP configuration tool takes the data manipulation specification and generates the macros 
data_manipulations_send and data_manipulations_receive for the send and the receive 
ILP loops using a macro library {ilp.h) predefined for the different data manipulations (Figure 
4). The include file generated by the ILP configuration tool is finally included by the (un)mar- 
shalling routines. The developed configuration tool simplifies ILP programming further. The 
user has only to describe the data structures using the annotated C language {I Eh. anno) and 
the configuration of the data manipulations within the ILP loop (iFJlp.cfg.h) as described in 
Section 4.2. 

The ILP configuration tool generates the files IFJlp.h and commonjlp.h. The file IFJlp.h 
includes the ILP loops for sending and receiving data. The file commonjlp.h contains a defini- 
tion of a data structure for internal purposes and for variables and results required by the dif- 
ferent data manipulation functions. Furthermore, it contains an initialization routine for 
initializing variables of the defined data structure. The configuration tool may be used for dif- 
ferent input files containing data description using annotated C language. In that case, the file 
commonjlp.h contains the data structure, which is common to all processed input files. 
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Figure 4 ILP Configuration Tool 



4.2 Configuration language 

To specify the desired ILP loop for sending and receiving data, a simple ILP configuration lan- 
guage has been developed. This language relieves the application programmer from imple- 
menting an ILP loop for each possible combination of data manipulations. Instead, it is only 
necessary to provide a macro library (ilp.h) and a C library (ilp.c). The file ilp.h must contain 
the macros for the different data manipulations according to a certain uniform format 
described below. The file ilp.c contains mainly some initialization functions and some func- 
tions to process the result after the ILP loop. 



#define macro_name ( integerl, . . . , in tegerN, parameter 1 , . . . ,parameterN) { . . . } 

If the basic processing unit size of a data manipulation is N*4 bytes, N integer registers are 
required in the parameter list. In addition to these registers, the parameter list may contain 
additional parameters, which have to be described in the ILP description using the RESULT 
and the VARIABLE keywords. Each data manipulation takes the N registers as input, manipu- 
lates them, and substitutes the initial contents by the final result. 

A complete ILP loop configuration description consists of a set of several data manipulation 
functions. Each single data manipulation function is described within a separate block inde- 
pendent of the other functions. Therefore, each ILP description QFJlp.cfg) consists of M 
blocks, with M as the number of data manipulation functions of the ILP loop. The beginning 
of a block is indicated by [BLOCK_NAME] . Each block itself consists of an arbitrary number 
of assignments. For one data manipulation function there exists exactly one block for both the 
send and receive data manipulation. 
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description : 

block {block} 

block: 

" [ "blockname" ] " assignment {assignment} 

assignment : 

keyword" =" value 

keyword : 

"POSITION" I "NEXT" | "MACRO_NAME_S" | " MACRO_NAME_R " | "LENGTH" | "VARIABLE" | 
"RESULT" I "FINAL_S" | "INIT" 

There are several assignments for each block. Some of them are mandatory and must appear in 
each block (M). Others are optional, but must appear with a certain value exactly once in a 

description file(O^) or at most once in a block (O). Finally, there are optional keywords, which 
may appear arbitrarily often within a block (A). The allowed keywords are shown in Table 1 . 
Several data manipulations require other parameters for initialization (e.g., key tables for 
encryption) or results (e.g., checksum calculation). To specify additional parameters, the 
VARIABLE and RESULT assignments must be used. The assignment 

VARIABLE= ( t_name , v_name , i_value ) 

means that a parameter v_name of type t_name is required. The variable may be initialized 
with the optional value i_vaiue, which is a constant value. The syntax for the result assign- 
ment is quite similar: 

RESULT= ( t_name , v_name , i_value ) 

The data types for variable and result variables are currently limited to simple types such 
as scalar types or pointers to scalar types. The difference between variable and result vari- 
ables is that RESULT variables are used to build an advanced protocol architecture for ILP pro- 
cessing as explained in Section 5. result variables may be processed after the ILP loop by 
special functions. Note that also the following definitions are only necessary to support the 
advanced ILP protocol architecture as described in Section 5. The special functions are defined 
using the final_s and final_r assignments according to the format 



FINAL_S= (functionl (res_listl) , ..., functionN(res_listN) ) or 

FINAL_R= ( functionl (res_listl ) , ..., functionN(res_listN) ) . 

A res_iist is a list of an arbitrary number of result variables, which must be defined the 
corresponding block. The special functions are processed after the ILP loop and must be 
implemented in a separate library to be linked to the application program. The input parame- 
ters of these functions must match the sequence of their definition in the ILP configuration 
description. However, the functions expect a pointer value to the specified variable type. The 
last assignment is the init assignment. 

INIT=initfunction . 

The function init function shall contain some initializations of the internal variable to be 
used. This function also allows initializations, which must be performed only once, e.g. 
encryption tables at the beginning of an application program. In contrast to the initializations 
of the RESULT and variable variables, which are performed for each ILP loop (e.g., for 
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each packet), the initializations described in initfunction are performed only once within 
an application. 

Table 1 Configuration language 



keyword 




remark 


POSITION 


O^ 


POSITION allows to describe whether the data manipulation is on highest 
level (POS ITI0N=T0P) or on the lowest level (POS ITI0N= BOTTOM). 
Only the values TOP and BOTTOM are allowed. 

There must exist exactly one assignment (POSlTION=TOP) and exactly one 
assignment (POSITION=BOTTOM) within a description file. 


NEXT 


M 


pointer to the next lower data manipulation function on the send side. 


MACRONAME_S 


M 


macro name for the data manipulation in the send ILP loop. 


MACRONAME.R 


M 


macro name for the data manipulation in the receive ILP loop. 


LENGTH 


M 


length of the basic processing unit size of the data manipulation function. 

Input processing unit size must equal the output processing unit size. The case 
of different input and output length parameters is currently not supported. 


VARIABLE 


A 


parameter for initializations 


RESULT 


A 


parameter for results 


FINAL_S 


0 


function for final result processing on the send side 


FINAL_R 


0 


function for final result processing on the receive side 


INIT 


O 


one-time initialization routine 



4.3 Example 

The following example code shall illustrate the use of the ILP configuration tool. The example 
shows the description of an ILP loop where SAFERK-64 encryption (Massey, 1993) and TCP 
checksum calculation are integrated into the (un)marshalling routines. TCP checksum calcula- 
tion is the lowest data manipulation of the ILP loop. 

There are two data manipulations, namely checksum and encryption, checksum uses a 
RESULT variable with the name xsum of type u_int, which is initialized by o at the beginning 
of each ILP loop. For each ILP loop the function TCP_xsum_f inai may perform final process- 
ing steps on the result variable. 

The CHECKSUM data manipulation is on the lowest level of the ILP loop (bottom), while 
ENCRYPTION is on the highest level of the ILP loop (top). The top level function is performed 
first for sending and last for receiving, while the bottom level function is performed first for 
receiving but last for sending. 

The macros being used for the ILP data manipulations (writereg, readreg, tcp_check- 
sum, saferk64_encrypt, saferk64_decrypt) must be defined in the macro library ilp.h, 
while the file ilp.c is used to define functions for initialization and final result processing, 
writereg writes data to be manipulated and sent into registers, while readreg reads received 
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data into the registers for receive data manipulations. The ILP configuration tool generates the 
files commonjlp.h and IFJlp.h from these predefined files. The file commonjlp.h contains 
the definition of the internal ILP variable type ilp_var_type arid an initialization procedure 
required before each ILP loop. 

IF_ilp.cfg: 

[CHECKSUM] 

MACRONAME_S = t cp_checksum 
MACRONAME_R= t cp_checksnm 
LENGTH=4 

RESULT= (u_int , xsum, 0 ) 

PCS ITION=BOTTOM 

FINAL_S= (TCP_xsum_f inal (xsum) ) 

FINAL_R= (TCP_xsum_f inal (xsum) ) 

[ENCRYPTION] 

MACRONAME_S = s a f e r k 6 4_enc ryp t 

MACRONAME_R=saferk64_decrypt 

NEXT=CHECKSUM 

LENGTH =8 

POSITION=TOP 

VARIABLE= (char* , key) 

VARIABLE= (u_int* , logtab) 

VARIABLE= (u_int* , exp tab) 

INIT=saferk64_ilp_init 

'^e file IFJlp.h contains the macros for ILP processing to be included into the (un)marshal- 
ling routines. The data manipulation macro for sending collects the data in 4 byte steps (the 
basic unit size of the XDR marshalling routine) and performs the data manipulations for each 
L bytes (L: lowest common multiple (LCM) of the data manipulations). The receive data 
manipulation macro takes L bytes and delivers the data to marshalling in 4 byte steps. The 
code example below shows the output of the ILP configuration tool for the example input file 
IFJlp.cfg. 

IF_ilp.h: 



# include "ilp.h" 

# include "common_ilp .h" 

/*___ ILP_LOOP_SEND */ 

#def ine data_manipulations_send ( intreg, dst , ilp_ptr) \ 
switch ( (ilp_ptr) ->counter) (\ 
case 0 : \ 

(ilp_ptr ) ->buf f erO= ( intreg) ; \ 

(ilp_ptr) ->addressO= (dst) ; \ 

( ilp_ptr ) ->counter++ ; \ 
break ; \ 
case 1 ; \ 

(ilp_ptr) ->counter=0 ; \ 

saferk64_encrypt ( (ilp_ptr) ->bufferO, (intreg), (ilp_ptr) ->key, \ 
( ilp__ptr) ->logtab, ilp__ptr) ->exptab) ;\ 
tcp_checksum ( ( ilp_ptr ) ->buf f erO , ( ilp_ptr ) ->xsum) ;\ 
tcp_checksum( (intreg) , ( ilp_ptr ) ->xsum) ;\ 
writereg( ( ilp_ptr ) ->buf f erO , ( ilp_ptr ) ->addressO ) ; \ 
writereg ( intreg, dst) ; \ 
break ; \ 
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/★ ILP_LOOP_RECEIVE */ 

#def ine data_manipulations_recv (src , intreg, ilp_ptr) \ 
switch ( ( ilp_ptr ) ->counter) { \ 
case 0 ; \ 

( ilp_ptr ) ->counter++ ; \ 
readreg(src, intreg) ;\ 

readreg( ( (int*) (src) )+l, (ilp_ptr) ->bufferO) ;\ 
tcp_checksum( (intreg) , (ilp_ptr) ->xsum) ; \ 
tcp_checksum( ( ilp_ptr ) ->buf f erO , ( ilp_ptr ) ->xsum) ;\ 
saferk64_decrypt (( intreg) , ( ilp_ptr ) ->buf ferO , ( ilp_ptr ) ->key, \ 
( ilp_ptr ) ->logtab, ilp_ptr ) ->exptab) ; \ 
break ; \ 
case 1:\ 

( intreg) = ( ilp_ptr ) ->buf ferO ; \ 

( ilp_ptr ) ->counter=0 ; \ 
break ; \ 

} 



5 TOOL EXTENSIONS TO SUPPORT AN ALF/ILP PROTOCOL 
ARCHITECTURE 

5.1 ALF/ILP protocol architecture 

Several guidelines for future protocol architectures supporting ILP have been proposed in 
(Braun, 1995). To implement ALF/ILP concepts efficiently, the following design properties 
are essential: 

• fixed size packet headers, 

• packet trailers for user data dependent fields (e.g., checksum values), and 

• separate packets for user data and control functions 

We have developed an ALF-based communication architecture which is based on these pro- 
posals. Application data units (ADUs) are the basic processing unit in an ALF-based commu- 
nication architecture and an ADU is the only data structure to be used for protocol processing. 
Usually, there exist different types of ADUs, e.g. ADUs for control purposes and ADUs for 
pure application data, such as graphical data to be displayed on a screen. However, the exist- 
ence of several ADU types should be transparent for ADU processing in order to allow a sin- 
gle ILP loop for all kinds of ADUs. For composition and parsing ADUs to be sent or received, 
the (un)marshalling routines of the INRIA stub compiler (ISC) are used. This allows to specify 
an ADU as an union of several ADU subtypes. The (un)marshalling procedure provides full 
processing of such an ADU independent of its type. Only one (un)marshalling procedure is 
necessary in this case for all kind of ADUs. 

The input file IFJlp.h.anno for ISC/ILP can be provided by an application programmer or by 
another tool such as the ALF Compiler Control described in (Braun, 1996). Tlie ALF compiler 
takes an annotated C specification of ADUs containing only user data and an ESTEREL proto- 
col specification both provided by the application programmer as input. It generates an anno- 
tated C specification of user data ADUs with additional control ADUs required for protocol 
functions described in the ESTEREL protocol specification. 

Integrating ILP functions into the (un)marshalling procedures causes two problems: 

• Different ILP manipulations require different alignment boundaries. 

If an ADU is not adigned to those boundaries, padding bytes have to be inserted. How- 
ever, the amount of padding bytes required for the different functions might differ. 
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• Protocol data unit fields may depend on the ILP data manipulation. 

Putting those fields to the header of an ADU might cause additional copy operations 
within an implementation or prevents the use of ordering-constraint data manipulation 
functions (Braun, 1995). 



The first problem is solved by calculating the number of padding bytes matching the require- 
ments of all data manipulations. The amount of padding bytes is calculated so that the manipu- 
lated data including the original ADU and the padding bytes are a multiple of the lowest 
common multiple of all length values of the used ILP data manipulations. The second problem 
is solved by adding the result values, i.e. the values depending on the data manipulations, to 
the end of packet. These values are also (un)marshalled using (un)marshalling routines gener- 
ated by a stub compiler such as ISC to support heterogeneous environments. The selected 
solutions result in a special data structure of an ADU appropriate for ILP processing. The 
structure is shown in Figure 5. 




Figure 5 ADU structure for ILP 




manipulated data 



data noi manipulated 



5.2 Tool extensions and Application Programming Interface 

To support the advanced ALF protocol architecture, the ILP configuration tool generates some 
additional files for ILP processing of an ADU. First, a file ''ILP_RES.h.anno" is generated, 
which serves as an input file of a second run of the ISC/ILP stub compiler, in which (un)mar- 
shalling routines for the ILP results without additional ILP data manipulations are generated. 
Furthermore, the ILP configuration tool generates functions for sending and receiving ADUs. 

The functions for sending performs the following steps: 

• marshalling and ILP data manipulation as defined for the ADU 

• adding padding bytes to the marshalled and manipulated ADU 

• adding the ILP data manipulation results to the end of the marshalled and manipulated 
message using a marshalling function without ILP data manipulations 

The function for receiving performs: 

• unmarshalling and data manipulations of the ADU and the padding information 

• unmarshalling of the data manipulation results 
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• comparison of data unmarshalling results and unmarshalled results calculated at the 
receiver 

At the last step, it is checked whether the received data was completely unmarshalled that must 
be the case in an ALF architecture. Furthermore, the lowest data manipulation function is 
detected, where the results for sending and receiving data differ in order to detect errors. The 
configuration tool generates a file IFJlp.c, which contains the C code of the functions for 
sending and receiving ADUs. 
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Figure 6 ILP configuration tool extensions for ILP processing of ADUs 



The interfaces of these functions for sending and receiving data look as follows: 



int *ilp_send_DATATYPE(ilpresult, buffer, DATATYPE_ptr) 

ILP_RES *ilpresult; 
int *buffer; 

DATATYPE *DATATYPE_ptr ; 

The first parameter (ilpresuit) is a pointer to the generated type for storing the ILP data 
manipulation results. This variable may be initialized or evaluated by the application program. 
The second parameter (buffer) is a pointer to the output buffer, where the marshalled and 
manipulated data must be written. The last parameter (DATATYPE_ptr) is a pointer to the data 
type describing an ADU. The function returns a pointer to the end of the buffer, to which the 
data has been written during marshalling and data manipulations. The function for receiving 
data has the following interface: 



int ilp_receive_DATATYPE ( ilpresult_r , ilpresult_s, buffer, bufferend, 
DATATYPE_ptr) 

ILP_RES *ilpresult_r; 

ILP_RES *ilpresult_s; 
int *buffer; 
int *bufferend; 

DATATYPE *DATATYPEptr ; 
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The first two parameters contain the data manipulation results. The first one (iipresuit_r) 
contains the results calculated by the receiver during unmarshalling and data manipulation. 
The second one (iipresuit_s) contains the result calculated at the sender. If one or more data 
manipulation results differ an error value indicating the lowest incorrect data manipulation is 
returned. This value describes the lowest level where the results differ. The next two parame- 
ters describe the beginning (buffer) and the end (buf ferend) of the received data, and the 
last parameter is the pointer of the unmarshalled variable (DATATYPE_ptr). The function 
returns an error value as described above. 



6 EVALUATION 

To estimate the achievable benefits of ILP, a file transfer application with an encryption func- 
tion on top of a user-level TCP has been implemented and the performance of the application 
has been measured. The results (Braun, 1995) show that it is possible to obtain performance 
benefits by integrating marshalling, encryption and TCP checksum calculation. The experi- 
ments yielded in a throughput gain of only 10-20% in contrast to the 50% gain achieved for 
simple loop experiments (Clark, 1990). This makes the benefit of ILP debatable in existing 
communication systems and workstations. 

The ILP code for the experiment described in (Braun, 1995) has been written manually. The 
code generated by ISC/ILP is very similar to the code generated manually. Experiments with 
automated code generation did yield noticeable performance differences. 

This fact is not surprising, because the ILP compiler inlines hand-written macros from the 
macro library into the ILP loop. The number of read/write accesses and CPU operations is 
nearly the same in both implementations. Therefore, the same performance improvements by 
ILP can be achieved using ISC/ILP for automated ILP code generation instead of manual cod- 
ing the ILP loops. The evaluation in (Braun, 1995) showed that ILP benefits are mainly 
reduced access to (cache or main) memory. 



7 CONCLUSIONS 

As discussed in (Braun, 1995), the ILP performance improvements may be very small. The 
smaller the expected ILP benefit the smadler are the efforts a programmer is willing to spend 
for ILP. A tool for automated ILP loop generation decreases the efforts for ILP implementa- 
tions. Therefore, ILP implementations are rather acceptable considering the cost/benefit trade- 
off if there are tools for automated coding helping to minimize implementation costs. 

The fact that the code generated automatically does not perform worse than manually gener- 
ated code increases the attractiveness of automatic code generation tools for communication 
software development. 

The advantages of such a tool are obvious. The time to develop efficient protocol code can be 
kept small. Automatically generated code is much less faulty. This reduces the efforts to test 
and debug the code. Replacing data manipulations by other ones only requires to run the ILP 
code generation tool and to re-compile the resulting (un)marshalling procedures. 
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ABSTRACT 

The long awaited *new environment' of high speed broadband networks and multimedia 
applications is fast becoming a reality. However, few systems in existence today, whether they 
be large scale pilots or small scale test-beds in research labs, offer a fully integrated and flexible 
environment where multimedia applications can maximally exploit the quality of service (QoS) 
capabilities of supporting networks and end-systems. In this paper we describe the 
implementation of an adaptive transport system that incorporates a QoS oriented API and a 
range of mechanisms to assist applications in exploiting QoS and adapting to fluctuations in 
QoS. The system, which is an instantiation of the QoS Architecture (QoS- A), is implemented 
in a multi ATM switch network environment with Linux based PC end systems and continuous 
media file servers. A performance evaluation of the system configured to support Video-on- 
Demand application scenario is presented and discussed. Emphasis is placed on novel features 
of the system and on their integration into a complete prototype. The most prominent novelty of 
our design is a 'distributed QoS adaptation' scheme which allows applications to delegate to the 
system responsibility for augmenting and reducing the perceptual quality of video and audio 
flows when resource availability increases or decreases. 

1. Introduction 

In this paper we describe the design, implementation and evaluation of the Quality of Service 
Architecture (QoS -A) [Campbell ,93] transport system, called the multimedia enhanced transport 
system (METS). METS supports multi-layer coded flows in a multicast, multimedia networking 
environment in which client workstations have varying capabilities. In terms of services, METS 
offers a flexible quality of service ( QoS) configurable API at the transport layer. In terms of 
mechanisms, it populates the network layer as well as the transport layer with a number of modules 
providing control over QoS. 

The novel aspects of METS relate to the protocol, QoS maintenance and flow management 
planes of the QoS-A as illustrated in Figure 1. Briefly, the protocol plane is responsible for 
transferring media with a target level of QoS. The QoS maintenance plane is then responsible for 
the fine grained monitoring and maintenance of the protocol plane. For example, at the transport 
layer, the QoS maintenance plane monitors rate, loss, jitter and delay and takes remedial action 
when they fluctuate. Finally, the flow management plane is responsible for flow establishment 
(including end-to-end admission control, QoS based routing and resource reservation), QoS 
mapping, which translates QoS representations between layers, and coarse grained QoS 
management (e.g., re-negotiation of QoS). 
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The paper describes flow schedulings flow shaping and basic flow monitoring in the QoS- A 
protocol plane. Flow scheduling and shaping are fundamental to the smooth pacing of media onto 
the network and regulation of media between end-systems. Flow monitoring also plays an 
important role in measuring the performance of the flow as media is dehvered to the receiver. In the 
QoS maintenance plane, the most important functions are transport QoS management and jitter 
correction which works in unison with the flow monitor to smooth out network induced jitter. In 
the flow management plane, METS provides QoS groups which encapsulate multicast sessions in 
which participants with heterogeneous QoS capabilities/ requirements may participate. The flow 
management plane arranges for per-participant QoS negotiation and resource allocation to take 
place and is also responsible for informing the user of ongoing QoS performance. 




Figure 1: QoS- A Model 

The other key aspect of METS described in this paper is dynamic QoS adaptation. This is a 
flow management plane mechanism designed to exploit the layered encoding property of currently 
popular media formats. An example of a media format with layered encoding is MPEG 
[H .262 ,94]. MPEG structures video streams in terms of three layers: a coarse or base 
representation of the signal plus successive enhancement layers. In the case of MPEGl , the base 
layer (BL) is represented by I pictures and the enhancement layers (El and E2) by P and B 
pictures, respectively. In our dynamic QoS adaptation scheme, QoS adapters take remedial action, 
based on a user supplied QoS policy, to scale flows (e.g. by adding or removing enhancement 
layers or instantiating filters) when resource availability and/or user QoS requirements change. 
Scaling is a term, first proposed by [Delgrossi,93], used to refer to the dynamic manipulation of 
media flows by objects called filters as they pass through a communications channel. Example 
MPEG filters are coarse grained picture droppers and fine grained low-pass filters (which trade off 
bandwidth for picture resolution) [Yeadon,94]. 

The remainder of this paper is structured as follows. We first present, in section 2, a detailed 
description of the METS AH and some of its key internal mechanisms. Then, in section 3, we 
present a performance evaluation of our METS implementation which throws light on the 
feasibility of the proposed mechanisms and identifies bottlenecks and pointers for further 
optimisation. We present our conclusions in section 4. 
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2. QoS-A Adaptive Transport System 

2.1 Transport Application Programmer’s Interface 

The METS API is realised as a set of extensions to the Berkeley socket APP . QoS is specified 
at the API in terms of a flow specification and a QoS policy. The flow specification includes 
parameters such as delay, throughput, jitter etc. The QoS policy allows the user to advise the 
infrastructure on how to deal with the flow when resource availability changes. For ex 2 imple, the 
QoS policy may require that the system reduces QoS when resources are in short supply (perhaps 
by frame dropping [Yeadon,94] or shaping filters [Eleftheriadis,95]), or it may simply require that 
the user be informed of any degradations. A QoS policy may also require that QoS be raised when 
resources become available; e.g., by adding an enhancement layer to a hierarchically structured 
video flow [Shacham,92] [Kanakia,93]. 

The API assumes a client-server model in which servers (i.e. applications offering layer 
encoded media flows to potential clients) create QoS groups with a given flow specification and 
QoS policy. After a group has been created and advertised, interested clients may join QoS groups 
by determining the flow specification and QoS policy, and selecting an appropriate (set of) 
component part(s) (viz. BL, El, E2) of the flow. They are then free to join the group (in effect, 
establish a connection to an underlying multicast SVC) via a set of connection management 
primitives. In addition to requesting information about the server's QoS profile, clients and servers 
may also retrieve detailed statistics about membership of a particular group. 

The API provides three types of socket: i) media sockets used for the transfer of continuous 
media; these are simplex and QoS configurable via a flow specification and QoS policy; ii) control 
sockets used for the transfer of application level control information; these are full duplex and 
assured (i.e. they provide a reliable delivery service); and iii) management sockets used to interface 
with the QoS maintenance and flow management planes. 

The API primitives are structured into the following categories: 

• group management primitives, which allow the user to open a multicast group, get group 
information and gracefully close a group; 

• connection management primitives, which allow both clients and servers to join and leave 
QoS groups; and 

• flow management primitives, which allow both clients and servers to perform ongoing 
management and monitoring of flows in which they are participating. 

The group management and connection management primitives are conceptually 
straightforward; it is the flow management primitives (see Figure 2) which provide most of the 
QoS support. These allow servers and clients to register flows, to change the QoS of flows and to 
receive QoS signals associated with a particular flow. Whenever clients and servers create media 
sockets they register them using the registerSoc primitive. This allows the underlying flow 
manager to interact with the application over the associated management socket to provide 
monitoring and maintenance information about the on-going flow. 



The API is based on a new protocol family called AF_METS. By preserving compatibility with the current 
Berkeley socket API, existing applications (e.g., those using AF_INET) can run unchanged or can easily be 
modified to take advantage of underlying QoS support; see section 3 for more details of the implementation 
environment. 
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Flow Management Primitives 


Parameters 


registerSoc 


management sock, media sock 


changeQoS Req 
changeQoS Ack 
changeQoS Nak 


management sock, 

options (flowSpec 1 QoSPolicy 1 maint 1 monitor 1 
signal 1 adapt 1 filter 1 event), 
structure (flowSpec 1 QoSPolicy 1 maint 1 monitor 1 
signal 1 adapt 1 filter i event), 
sizeof (flowSpec 1 QoSPolicy 1 maint 1 monitor 1 
signal 1 adapt 1 filter 1 event) 

Status 


signalQoS 


type (signal 1 event ) 

option (QoSMetric 1 QoSEvent ) 



Figure 2: Flow Management Primitives 



At any point during a session, group members may change the QoS negotiated during the 
connection establishment phase using the changeQoS primitive. The options accepted by this 
primitive are as follows: 

• FlowSpec is used to submit a new flow specification to renegotiate QoS^ ; 

• QoSPolicy is used to submit a new QoS policy; 

• Maint, Monitor and Signal are used to select QoS maintenance options: in Maint mode the 
transport QoS manager actively maintains^ the flow; in Monitor mode it maintains the flow 
and dso forwards periodic QoS ‘signals’ (via signalQoS) to the user via the management 
socket; in Signal mode it does not maintain the flow but forwards QoS signals anyway; 

• Adapt is used to change the adaptation mode (viz. discrete or continuous). Discrete mode 
refers to the addition/ removal of enhancement layers whereas continuous mode involves 
the instantiation of, for example, source bit rate filters [Campbell ,96a], which permit fiilly 
continuous adaptation of bit rates; 

• Filter is used to explicitly select new filters (e.g. jitter filter at the receiver or picture 
dropping and low pass filters [Yeadon,96] in the network); and 

• Event is used to allow applications to attach alarms to the occurrence of particular event 
thresholds. Should the threshold be exceeded then a message is asynchronously sent to the 
interested party via a signalQoS primitive on the management socket. 

Note that while a change in QoS initiated by a client only affects the local client’s QoS, a 
change by the server may impact all active clients in the current session. 

2.2 METS Mechanisms 

We now describe key aspects of the implementation of the METS transport system. This 
section focuses on the mechanisms used to support QoS maintenance and flow management. 
Mechanisms in the end-system and in the network are described in separate sub-sections. 



1 We do not exhaustively specify the various fields in a flow specification or QoS policy in this paper; these 
details are available in [Campbell ,96a]. 

^ This means that it instantiates mechanisms such as those detailed in section 2.2 to attempt to deliver the 
required QoS in the face of QoS fluctuations in the underlying network/ end-systems. 
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The transport system is comprised of four implementation modules which map closely to the 
QoS -A control plane, user plane, QoS maintenance plane and flow management plane, 
respectively. As illustrated in Figure 3, within each of these per-plane modules METS provides a 
range of QoS mechanisms. Rather than describe the full range of mechanisms (these are all detailed 
in [Campbell,96a]), we concentrate here on those mechanisms (highlighted in Figure 3) in the data 
plane which are responsible for flow scheduling and shaping, those in the QoS maintenance 
responsible for controlling jitter and those in the flow management plane responsible for coarse 
grained adaptation to changes in QoS by means of adding/ removing enhancement layers and/or 
instantiating filters. These are the mechanisms that are the main focus of the performance 
evaluation in section 3. 
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Figure 3: METS Transport System 

2.2.1 Transport QoS Mechanisms 

2.2.1. 1 Flow Scheduler 

The flow scheduler operates in conjunction with the flow shaper (see Figure 4 below; also see 
Figure 3) to provide appropriate rate control to ensure per flow bandwidth guarantees and help in 
jitter management. 

The role of the flow scheduler is to schedule application level frames (ALF) [Clark ,90] on a 
coarse grained frames-per-second basis. It is implemented in a user space library and is used in 
both transmission and reception. In transmission, the scheduler uses the standard Linux clock 
timer to dispatch application level frames to the flow shaper stage according to deadlines derived 
from the flow specification. A variable bit rate service is provided by isochronously scheduling 
variable sized packets to the lower layers. At the receiver, the flow scheduler relies on the jitter 
filter (described below) to provide the scheduling deadlines L 

A problem with implementing the flow scheduler in the Linux environment is that it relies on 
the coarse grained Linux clock which can result in slippage or drift. To alleviate this problem a drift 
compensation function [Simpson ,95] is used which takes into account any missed deadlines. If a 
deadline has been missed then the flow scheduler immediately allows the application to transmit or 
receive media. The duration of the next scheduling opportunity is then calculated and takes any 



^ In this role, the jitter filter adjusts the deadline of delivered frames but not the rate. 
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drift in the isochronous rate of the transmitter into account. The flow scheduler keeps track of any 
missed deadlines and informs a module in the flow management plane should the number of 
missed deadlines exceed a pre-defined missed deadline threshold. As described in section 2.1, 
flow management can upcall applications (via the management socket) to inform them of such 
events. 

When layered-encoded flows are being used, QoS- A applications are generally designed to 
only transmit base layer frames when deadlines are being missed consistently. When congestion 
has cleared, flow management informs the application via a QoS event signal to resume the original 
rate. Of course, the flow scheduler does not understand the semantics of layered flows. 

iyncbfomv$ pscki^t M£TS 




Figure 4: Flow Scheduling and Shaping 

2.2.1 .2 Flow Shaper 

The flow shaper provides open loop flow control based on a token bucket scheme that paces 
cells to the network interface. It is implemented as part of the kernel-level ATM device driver and is 
invoked every 1 ms using a dedicated hardware timer. The token bucket scheme is a variant of the 
leaky bucket algorithm. In this scheme, flows accumulate credits which represent the number of 
ATM cells that can be transmitted to the network over the next interval. 

The flow shaper maintains the following per flow state: i) a token budget, b, which represents 
the capacity or depth of the token bucket; ii) a token credit, r, which represents the remaining 
credits {0<r<b) left in the token bucket at any point in an interval; iii) a token refresh rate,p, 
which represents the rate at which the token bucket is refilled; and iv) a token timeout, t, which 
represents the number of ticks remaining until the token refresh rate expires and token credits are 
refreshed. 

The token bucket scheme operates in a rather simple fashion. When the token timeout expires 
the token credit is set to be equal to the token budget. One credit represents one cell transmission 
opportunity. When the flow shaper executes, it visits all the per flow queues in a round robin 
fashion and, if the queue contains cells and has available credits, then the shaper transmits one cell 
from the queue and decrements the token credit state variable. This achieves the desire effect of 
interleaving of cells from different SVCs onto the network. 

There are two possible outcomes at the end of a token refresh interval: either all the cells have 
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been drained or cells remain in the queue. If cells remain in the queue then all the credits have been 
used during the interval. If no cells remained queued then all queued cells have been dispatched to 
the network during the interval with (0<=r < b) credits potentially remaining. At the end of a token 
refresh cycle any credits remaining at the end of the interval are lost. 

2.2.1 .3 Jitter Filter 

The objective of the jitter filter is to restore the timing properties of the originally transmitted 
data flow at the source before the flow is delivered to the play out device at the sink (i.e. to remove 
end-system and network induced jitter). 

It is a straightforward process to recover the original timing in communication systems that 
provide a hard bound on delay (it is only required to buffer packets at the sink until time T+D, 
where T is a timestamp placed in the packet at the source and D is the bounded maximum end-to- 
end delay). Many networks, however, are unable to guarantee a hard maximum delay bound; in 
such cases the receiver must form a continuously updated estimate of the maximum delay as the 
basis on which to calculate buffering time. In the algorithm we have adopted (based on work by 
[Jacobson,88] [Ramjee,94] and using synchronised clocks [Mills ,92]) statistical analysis of per- 
packet delay is used to estimate the maximum delay; i.c.D = d + r * s, where d is the average delay 
(see below), s is the standard deviation (see below) and r is a filter coefficient which is chosen as a 
function of the form of the distribution curve and the number of failures that one is ready to accept 
[Huitema,95]. 

Each failure corresponds to a transmission delay larger than the estimated maximum. Packets 
that arrive after D are considered to be too late to help reconstruct the signal and, in this case, flow 
management is informed of a late packet event and the packet is dropped. A popular value for r is 
2, which corresponds to an accepted loss of about 1% if one assumes a Gaussian distribution of 
the delays [Huitema,95]. From an intuitive point of view the 2s term is used to set the playout time 
to be "far enough" beyond the delay estimate so that only an acceptable fraction of the arriving 
packets should be lost due to late arrival [Ramjee,94]. For a full discussion of these issues see 
[Jacobson ,88]. 

The jitter filter continuously estimates the average delay and standard deviation. It is based on 
the low pass filtering algorithm' used in TCP for the estimation of the acknowledgement delay 
time [Jacobson,88] and operates as follows. When a METS packet arrives, the transmission delay, 
r, is determined as the difference between the received time and the emission time stamp. The 
average delay and standard deviation are then calculated as follows: d = doid + a(t - d),s = Soid + 
b(\t -d\- s). The constants a and b{a,b< 1) are smoothing coefficients. Typical vales are 1/8 and 
1A6, respectively, which make the calculation of d and s particularly efficient. 

Having determined a method for calculating a continuously updated estimate of D, it is 
necessary to decide on an appropriate time granularity at which D, and thus the playout time, 
should be updated. It is clearly not desirable to continuously alter the playout time on a frame by 
frame basis. In our scheme, the playout time is only used to influence playout at the beginning of 
each MPEG ‘group of pictures’ (GOP). The objective (as in the vat audio tool [Jacobson ,93]) is to 
keep any adjustments as imperceptible as possible to the human visual system. If flow management 
determines that too many losses have occurred it calculates a new playout time. When no losses are 
detected the same playout point is adhered to. If no losses occur over a number of monitoring 
periods the playout time is recalculated. In this way the jitter filter can, wherever possible, ‘pull in’ 
the playout estimate to decrease end-to-end delay. 
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2.2.2 Network QoS Mechanisms 

2.2.2.1 Distributed QoS Adapter 

The distributed QoS adapter resides in the flow management plane and is responsible for 
informing source and sink applications (in a manner controll^ by the QoS policy) when additional 
resources become available Aat can be used to support addition^ enhancement layers (i.e. El and 
E2) in layer-encoded flows. Note that in our scheme the QoS of the base layer (BL) is guaranteed. 
This is based on end-to-end admission control and resource reservation (see [Campbell ,96a] for 
full details). The base layer is, therefore, independent of the QoS adaptation process and the latter 
only manages the enhancement layers. The QoS adapter may also unilaterally choose to discard a 
enhancement layer (again, dependent on the QoS policy) if resources become scarce. Instances of 
the QoS adapter are present in both end systems and network switches. 

The QoS adapter operates by periodically probing the network to test for resources that have 
just become available or unavailable. At the start of each cycle (called an era), the local QoS adapter 
instance of each receiver determines its local resource availability and sends a RES signalling 
message containing an indication of the additional bandwidth required to support the addition of 
enhancement layers. This is inspired by a similar mechanism in the RSVP protocol [RSVP,95]. In 
the case where the adaptation protocol determines that the congestion^ has been detected over the 
preceding interval it reduces its request (from E2 to El or El to 0) before sending the RES 
message. Once the QoS adaptation protocol notes that congestion has passed; that is to say, when it 
notes that no congestion indications have been received over the last interval, it increases its 
bandwidth demands accordingly. RES messages are forwarded toward the core switch 
[Ballardie,93] of the multicast SVC (see section 3.1). These messages can be updated on their way 
to the source by all intervening switches to reflect the resource availability at traversed switches. 
When the RES message arrives at the source, it indicates the advertised rate (i.e., bandwidth 
available) to the source over the next era. The advertised rate can be zero, El or E2 cells/second. 
Because we support the concept of end-to-end adaptivity in the QoS- A, the adaptation protocol at 
the source also t^es into account the end-system resource availability. This allows the source side 
adaptation protocol to reduce the advertised rate (e.g., reducing E2 to El or El to zero) in the RES 
messages if need be. Following this, the source informs the application (via its management 
socket) should there have been a change in the advertised rate over successive intervals. If there is 
a change, the QoS adaptation protocol requests the flow shaper to reconflgure the per token bucket 
flow state (see section 2.2.1 .2) based on the new advertised rate. The source then responds to the 
RES message by multicasting an ADAPT message to all receivers indicating the available 
bandwidth over the coming era. When it receives an ADAPT message indicating that it may add or 
must remove an enhancement layer, the QoS adapter at a receiver informs the application so that it 
can arrange to start dealing with a modifled flow. 

The adaptation protocol is complicated by the potential presence of QoS filters and media 
selectors (see section 2.2 .2 .2) at switches in the network. QoS filters and media selectors are a 
concern in that the bandwidth requirement on the input side (i.e., upstream) of a filter/media 
selector may not be equal to the bandwidth requirement on the output side (i.e., downstream). This 
issue is resolved by treating network nodes supporting filters and media selection as virtual sources 
and/ or virtual sinks. To understand the concept of virtual sources and sinks refer to Figure 5. If a 
filter is placed on the located at the rook ATM switch a client consuming media at dwp from a 



^ If a switch detects that queues are building beyond a pre-defmed threshold the switch sets the congestion bit in 
the ATM header of cells. The flow monitor detects this condition and notes it in the flow's congestion state (see 
Figure 4). At the end of an era the QoS adaptation protocol reads the congestion state and reset it to 
uncongested. See [Campbell, 96a] for full details . 
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source at ate would consider rook a virtual source. Similarly, ate would consider rook as a 
virtual receiver. 

In order to support QoS adaptation in a multicast environment, switches must be able to merge 
RES messages from different down stream branches of the multicast distribution tree. Merging is 
rather simple in our system: a merged RES message carries a minimax pair (i.e., either {0,0}, 
{0,E1> or {E1,E2» which corresponds to the aggregation of all possible enhancement requests. 
As a consequence of this process the QoS adaptation protocol provides support for merging and 
forwarding of the RES messages in the network. It does this by building periodic merged 
messages based on all RES messages received over the last RES interval. Conversely, if no RES 
messages have been received at a merge point during the last RES interval then the timer is simply 
reset and no further action is taken. 

Another issue addressed by the adaptation algorithm is the fair allocation of residual bandwidth 
resources among all flows competing to add new enhancement layers. Residual resources are those 
remaining once all base layer reservations are accounted for. Each switch and end-system use a 
simple 'fair share' bandwidth allocation algorithm which allocates the residual bandwidth equally 
among competing flows. This plays an important part in providing the advertised rate, i.e., the 
portion of the residual bandwidth which can be used by a particular flow as it traverses a switch. 
The fair share algorithm only allocates resources to enhancement layers if it can meet one of the 
minimax constraints in the RES message. In the case where a flow's fair share is insufficient to 
support an enhancement layer then resources are returned to the pool and the advertised rate 
appropriately adjusted. For full details of this scheme see [Campbell ,96a]. 

2.2 .2 .2 Media Selector 

The media selector resides in network switches and works with the QoS adapter to ensure that 
the appropriate combinations of base layer and enhancement layer(s) of flows are forwarded to 
appropriate switch output ports of the ATM switch. 

As an example of the operation of the media selector, consider a multicast virtual connection 
between a source located at the ate end-system and two clients at mr-littie and dwp (see Figure 
5). The source node, ate, transmits full resolution video (i.e. BL+E1+E2), dwp selects the El 
resolution and mr- little selects the El and E2 resolutions. If an El cell arrives at the ehuf f 
switch and the network can meet both recipients’ QoS demands then the media selector will 
forward the cell to both dwp and mr-little. On the other hand, when an E2 cell arrives on the 
same connection the media selector will forward it to mr-little only. 

In order to carry out such functions, the media selector must be able to delimit individual AAL5 
frames in the cell stream so that BL, El and E2 frames can be distinguished (BL, El and E2 
packets are all carried on the same SVC)^ . To delineate AAL5 frames, the media selector exploits 
the ATM user-to-user bit: a user-to-user bit of 1 represents the first cell of an AAL5 packet, and the 
first 16 bits of the AAL5 payload then represent the type of the frame (BL, El or E2). It is 
important to note that the media selector does not have to buffer AAL5 packets to determine the 
type before forwarding. All that is required is to continuously monitor the user-to-user bit of cells 
traversing a switch and to perform a 16 bit comparison on the first 16 bits of the payload of the 
first cell of a new AAL5 frame. As a result cells are streamed through the switch with very little 
additional delay over the unmodified switch code. A fundamental assumption is that the switch 
architecture is capable of supporting such a scheme in software; see section 3.1 for details of the 
ATM switches used in the implementation. 



^ The media selector must also know which output ports of which multicast SVCs require which AAL5 frames to 
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3. Experimental Evaluation 

3.1 Experimental Environment 

The METS testbed consists of 4 Linux based PCs and two RAID-3 storage servers. Each PC 
and storage server is equipped with ATM network interface controllers (NICs) for connection to 
the ATM network. The storage servers, network interface cards and ATM switches are all 
experimental devices supplied by Olivetti Research Labs (see [French ,94]). 
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Figure 5: METS Testbed 

The Olivetti NICs deliver ATM cells at 100 Mbps and interfaces to a standard PC ISA bus. An 
AAL5 adaptation layer is implemented in software in the Linux device driver [Lunn,96]. The ATM 
switches are 4x4 ’s and share the same port design as the NICs. The switches have a measured 
aggregate switching capability of 200 K cells/sec. The switch architecture comprises an ARM 
processor connected via DMA channels to the ports. As switching is performed in software on the 
ARM processor it has proved relatively easy to modify the switches to support the QoS adapter 
protocol, multicasting and media selection. TTie multicasting implementation is based on the notion 
of a designated ‘core switch’ at which all clients involved in a single flow ‘rendezvous’ with the 
output provided by the server [Ballardie,93J. For example, an appropriate core switch for a 
multicast flow sourced at scaup and sinked at njy and dwp would be rook. In the current 
implementation, the core switch is not changed during the lifetime of a flow even if the newly 
joined recipients cause the topology to become less than optimal. 

As illustrated in Figure 5 the testbed includes the following client nodes: the ate and dwp end- 
systems, which are 90 MHz Pentiums, and the mr-iittie and njy end-systems which are 66 
MHz 486 machines. The RAID-3 storage servers (the magpie and scaup end-systems) utilise the 
ARM 610 RISC processor and incorporate five SCSI interface controllers and disk drives. 



be copied to them; this information is obtained from the QoS adapter. 
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3.2 The Experiments 

In order to gain experience with the METS testbed (which is based on a native ATM 
communication stack) and evaluate its performance, we have designed a set of experimental test 
suites: 



• bandwidth analysis, which evaluates the ability of the flow scheduling, flow shaping and 
ATM infrastructure to respond to varying bandwidth demands; 

• loss analysis, which evaluates the role of the flow scheduler and shaper in reducing losses; 

• delay analysis, which evaluates the effect of multiple flows on delay distributions; 

• jitter filtering analysis, which evaluates the jitter filter’s delay estimation and playout 
algorithms at the receiver; and 

• adaptation analysis, which evaluates the QoS adapter mechanisms at the end-systems and 
network. 

All performance measurements are taken from video flows sourced at the ate end-system and 
played out at the mr-iittie and dwp end-systems. The designated core ATM switch used during 
the multicast sessions is chuff. All measurements are captured and logged at the ate and dwp end- 
systems. The distance between the server and clients is 3 hops (i.e. flows emanating from ate are 
played out at dwp traversing the chuff, sparrow and rook ATM switches, respectively). In all 
cases the server and clients maintain logs of METS packet departure and arrivals times, 
respectively. The Network Time Protocol [Mills ,92] provides global timing facilities between all 
end-systems involved in the experimentation and logging process. The client log includes arrival 
times, absolute delay and jitter of received packets and loss of METS packets at the transport level. 

3.3 Bandwidth Analysis 

The object of this experiment is to determine the maximum possible transmission rate 
achievable from application space to the network. This is a measure of the maximum throughput of 
the METS communications stack in the end-system. The Canyon video clip is transmitted as 
rapidly as possible. Additionally, the traffic shaper and ATM device driver receive interrupt are 
disabled. Media traverses the METS transport system, AAL5 and ATM NIC without consideration 
of the receiver’s ability to consume media. 



Maxinuin bandwidth 
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The results are presented in the upper curve of Figure 6 which shows the throughput achieved 
when the METS packet size is varied incrementally between 1 byte and 32 KBytes. The maximum 
transmission rate attained is 32 K cells/sec (13.6 Mbps) compared to the theoretical NIC line rate of 
over 235 K cells/sec. 

This experiment (the lower curve in Figure 6) measures the 'goodput' achieved between a 
server and a client using flow scheduling and open loop flow control (through the flow shaper). 
Goodput is the maximum transmission rate at which cells can be injected into the network and 
consumed by the receiver with no significant loss resulting. To achieve the optimum goodput cells 
are 'paced' into the network at a rate agreed to between the transmitter and the receiver. A 
maximum goodput of 17.5 K cells/sec (7.4 Mbps) was measured. This is 54% of the maximum 
transmission speed figure. 

3.4 Loss Analysis 

The objective of this experiment (see Figure 7) is to determine the percentage of lost packets at 
the receiver as a function of the number of video flows received. Video was played at 12 fps. Tests 
were performed both with and without traffic shaping at the source. 




Figure 7: METS Frame Loss Distribution With and Without Traffic Shaping 

The upper curve in Figure 7 depicts the loss occurring from an unregulated source while the 
lower curve depicts the loss resulting from flows shaped by the METS flow shaper. The number 
of received video flows varies from 1 to 8. The maximum percentage loss measured at the receiver 
varies between 33% and 60% for regulated and unregulated traffic, respectively. In the case of 
unregulated traffic, performance degrades rapidly when four flows are received simultaneously. 
This represents a loss of greater than 10% - which is starting to be significant to the end-user. 
Regulated traffic on the odier hand only exhibits 10% loss when the number of flows approaches 



3.5 Delay Analysis 

3.5.1 Single Flow Delay Distribution 

The configuration of this experiment is a lightly loaded network with one video flow running at 
24 fjps. The average delay (see Figure 8) measured by the transport protocol at the receiver is 4 ms. 
The minimum and maximum delays recorded are 2 ms and 19 ms, respectively, with a standard 
deviation of 2 ms. 
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End-to-end delay distribution - 1 flow 




Figure 8: Per-Packet End-to-end Delay Distribution for One Flow 

3.5.2 Effect of Multiple Flows on Delay 

This second experiment in the delay suite measures the delay statistics experienced when the 
number of transmitted flows is varied between one and eight. As can be seen in the lower curve of 
Figure 9, there is little difference, just 6 ms, in the average delay measured as the number of flows 
increase. Variation in the maximum delay experienced is, however, significantly large at 42 ms. 

Transport end-to-end delay distribution for flows received 
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Figure 9 End-to-end Delay Distribution 

3.6 Jitter Filtering Analysis 

This experiment investigates the ability of the jitter filtering mechanism (see section 2.2.1 .3) to 
adaptively adjust the playout delay experienced by flows at the receiver to meet end-to-end delay 
and jitter requirements. The source packetises a single video stream and attempts to transmit it at an 
isochronous rate of 24 fps. At the same time, four other background flows are simultaneously 
being handled by the same receiver. 

In Figure 10, the playout curve tracks the arrival time curve to the first point of loss - the region 
between 42500 and 43000 ms. The first point of inflection represents a sudden increase in the end- 
to-end delay and subsequent loss of a number of METS packets. The second point of inflection 
(between 43000 and 43500 ms) also represents a large increase in measured delay, subsequent loss 
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of packets and then the simultaneous arrival of a group of packets at the receiver. It can be seen that 
significant packet loss occurs between 43000 and 43500 ms due to underestimation of the 
maximum end-to-end delay. 




Figure 10: Arrival and Playout Time Distribution High Load 

3.7 QoS Adaptation Analysis 

The final experiment investigates the performance of the distributed QoS adaptation 
mechanism. In the experiment, the network provides “hard” guarantees to the base layer (BL) of a 
multi-layer flow but gives no such guarantees to the enhancement layers (El and E2). The 
enhancement layers must compete for residual bandwidth with all other flows as discussed in 
section 2. 2 .2.1. 




In the experiment (see Figure 11) the receiver selects three flows for playout in the first 
instance. These are: 

• a "Canyon.mpg" video flow (selecting the BL, El and E2 components) at 24 fjps; 
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• a "Lougher.mpg” video flow (selecting the BL only) at 5 ^s; and 

• a "Flight.mpg” video flow (selecting BL) only at 5 fps. 

The scenario shows the consumption of the Canyon and Flight video clips beginning at time 
zero. Both base layers are supported. The QoS adapter determines that none of the Canyon flow’s 
higher resolutions can be accommodated given the available resources. Twenty seconds into the 
scenario trace the Flight video terminates thus freeing up resources. At this point the QoS adapter 
judges that the highest resolution of the Canyon video (i.e. BL+E1+E2) may be displayed. 

This situation remains until the user chooses to display the Lougher video 50 seconds into the 
trace. Resources are thus allocated to meet the base layer QoS requirements of the new video. The 
QoS adapter protocol sends a RES message toward the core requesting resources to meet the 
highest resolution (E2) of the Canyon .mpg video. In this instance, however, there are insufficient 
resources available to meet the QoS requirements of the highest resolution although resources are 
adequate to support the lower resolution (El). 

3.8 Discussion of Results 

The results of the first three experiments show the raw performance limitations of the 
experimental implementation. Regarding the bandwidth analysis, the bottleneck appears to be a 
combination of the limitations imposed by the ISA system bus and the experimental prototype 
NIC. For example, the host CPU must perform the segmentation and reassembly of AAL5 SDUs, 
and copying of cells to/ from host memory. Other limitations of the NIC are limited buffering and 
the fact that there is an interrupt at the receiver for each ATM cell delivered by the NIC^ . Although 
an obvious solution to these problems would be to employ a NIC that offered AAL processing and 
DMA data copying on-board, the experimental NIC is very valuable to us for experimentation as 
we have the flexibility of host CPU control of functions such as flow scheduling and shaping. 

As presented in Figure 6, receivers in our testbed can accommodate a maximum transmission 
rate of up to 8 K cells/sec after which they start to drop cells. In the case of the current 
implementation the loss of one cell causes an entire AAL5 packet to be lost. Thus, the loss of a few 
isolated cells can have a seemingly disproportionate effect on the delivered bandwidth, especially if 
packets are large. By adding ‘dummy’ replacement cells at the receiver whenever cell loss is 
detected, it should be possible to improve the results shown in Figure 6. Our experiments, though, 
have shown that traffic shaping is also a valuable technique for reducing cell loss. Receivers could 
consume four regulated Canyon flows at 12 fps with an overall frame loss of 10%. This rose to 
33% loss for eight flows consumed. Corresponding performance for the unregulated case was 
measured to be 15% and 60% for four and eight flows consumed, respectively. 

The loss analysis experiments show that traffic shaping is a valuable technique for reducing cell 
loss. Receivers could consume four regulated Canyon flows at 12 fps with an overall frame loss of 
10%. This rose to 33% loss for eight flows consumed. Corresponding performance for the 
unregulated case was measured to be 15% and 60% for four and eight flows consumed, 
respectively. 

The delay analysis highlighted that, while there were minor variations in the average end-to-end 
delay as more flows were consumed, there was considerable variation in the maximum end-to-end 
delay. The average delay difference experienced between one flow and 8 flows was found to be 6 
ms. In contrast, the maximum delay difference measured was 42 ms. Since the jitter filter works 



The ISA ATM device driver, however, reduces this overhead by checking whether any cells have arrived at the 
end of each receive interrupt cycle. 
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by estimating the maximum delay the latter results are significant. 

Turning to the jitter filter analysis itself, the jitter filter proved to be highly successful. One 
remaining problem is that sudden, unexpectedly large, jitter such as that evident in Figure 10 is 
difficult to deal with. A solution is to increase the value of the ‘confidence factor’ 2s (see section 
2.2.1 .3); overestimation of this kind, however, can lead to large buffer requirements at the receiver 
and an overall increase in end-to-end delay. Note that the jitter filter’s playout algorithm adjusts to 
fluctuations detected in the measured maximum delay but that these adjustments are rather 
conservative in nature. Conservatism, however, appears to be a good policy in the long term since, 
as is seen, the estimate is quickly back on track. Appropriate choice of filter coefficients, which 
influence the responsiveness of the playout algorithm to track changes in arrival patterns, is another 
key issue. Choosing larger coefficients would cause the playout to mirror fluctuations in the arrival 
time distribution; this, however, is not always the best policy. Optimally, the playout time should 
evolve in response to trends in the arrival time patterns and not in response to occasional spikes. 
During experimentation, the coefficient values of a=l/8 and ^1/16 were determined to be the most 
appropriate for our local area ATM environment. For a full discussion on filter coefficients see 
[Jacobson ,88] and [Ramjee,94] 

Regarding the QoS adaptation experiments, it was demonstrated that the distributed QoS 
adaptor scheme successfully maximises the utilisation of the available bandwidth by dynamically 
adjusting the resolutions of flows to meet the specific needs of different clients. While discrete 
adaptation was noticeable, the resulting perceptible changes were not unacceptable to casual users 
and the concept appears to be very promising. One disadvantage of the currently implemented 
scheme, however, is that discrete fluctuations may be undesirable in certain application contexts. A 
solution to this problem is the notion of continuous adaptation using the dynamic rate shaping filter 
[Eleftheriadis,95]. The integration of dynamic rate shaping into the current testbed is an issue for 
future work. 

4. Concluding Remarks 

This paper has described and evaluated an instantiation of the QoS-A transport layer in a local 
ATM environment. Although the implementation is successful, it remains to be seen how well the 
design - particularly the QoS adapter protocol and multicasting mechanisms - will translate to a 
wide area context with large numbers of receivers with heterogeneous QoS demands. The 
implementation was carried out in a conventional OS environment (Linux) in contrast to our 
previous implementation work [Coulson,95] which was embedded in a more deterministic system 
based on the Chorus microkernel. The fact that two separate instantiations of the QoS- A now exist 
provides evidence that the QoS-A framework is valid and useful. 

It has been demonstrated by performance evaluation that the METS QoS mechanisms (viz. 
flow scheduling, flow shaping and jitter filtering), which were specifically designed to operate in 
an adaptive environment, can successfully mask many of the finer-grained effects of fluctuating 
network and end- system resource availability. In addition, we have shown how the distributed 
QoS adapter protocol permits a far grosser level of adaptation while still maintaining a high degree 
of transparency for applications. We believe that our results vindicate the QoS-A approach of 
selective transparency of QoS mechanisms. Applications can choose QoS management strategies 
from a range of possibilities on a continuum from i) delegating all responsibility for dynamic QoS 
management to the underlying system, to ii) simply exploiting API upcalls to perform all dynamic 
QoS management for themselves (cf. vat, vie etc.). In all cases, applications choose their preferred 
strategy by appropriately initialising a QoS policy. 

Finally, as well as throwing light on performance measures, our implementation has also made 
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possible a qualitative assessment of the METS API. The relative complexity of the API was, in 
fact, found to be fairly easy to use in practice. The separation of concerns achieved through the use 
of separate sockets for data, control and management proved successful and application writers 
porting the NVS system from a standard Berkeley sockets environment found little difficulty. The 
potential complexity of QoS specification was largely avoided though the use of sensible default 
values for QoS parameters and policies. In the future we intend to further ease the complexities of 
QoS specification by dynamically deriving per-user QoS preferences from live sessions and storing 
them in a database for future application runs. At Columbia University we are experimenting with 
an open programmable networking environment called xbind [Lazar ,94] based on CORE A 
technology. We are currently investigating the suitability of our transport system, adaptation 
protocol and QoS filtering techniques for wired-to- wireless ATM environments [Porter ,94]; that is, 
we are augmenting xbind with QoS adaptive multimedia applications, QoS adaptive transports 
[Huard,96] and dynamic rate shaping to provide QoS controlled mobility [Campbell ,96b] in ATM 
networks and their wireless extension. 
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Abstract 

The rapid development towards networked multimedia applications leads to new and high re- 
quirements on the network as well as on the systems involved. Especially guaranteed Quality- 
of-Service (QoS) is required for certain applications with isochronous data streams or with 
highly interactive characteristics. Therefore, an integrated systemwide QoS management is 
needed. Among others, issues such as mapping of QoS parameters between fundamentally 
different levels of the system (user, application, network) and QoS monitoring are very impor- 
tant. This paper presents a framework for QoS management. Requirements of networked mul- 
timedia applications as well as several QoS management issues are discussed. Moreover, a 
QoS monitor as part of the QoS framework and its implementation Ts presented. 



Keywords 

Quality of Service (QoS), monitoring, QoS management 



1 INTRODUCTION 

Networked multimedia applications are getting increasingly popular. They even seem to 
change society rapidly towards a so-called information society. However, most available sys- 
tems are rather toys. They are by no means ready for usage outside the research and develop- 
ment community. A popular example can be seen in the widely used MB one tools for video- 
conferencing in the Internet (Kumar, 1995). Dependent on the time of the day, the Internet 
may provide a good or a lousy service quality to the user. Especially the audio stream is very 
sensitive with respect to packet losses. This instability in delivering a specified service makes 
those tools over the Internet inapplicable to professional environments. In addition, most 
common tools and standards provide only very basic control and cooparation mechanisms in 
order to effectively support conferencing and collaborative work. Even the ITU-T standards 
familiy T.120 (T120, 1994) presents a conferencing framework only. Generally, there exists an 
extrem mismatch between current capabilities of networked multimedia applications and ad- 
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veitisements made in research and non-research communities. Emerging networked multime- 
dia applications need much better support with respect to the QoS delivered to the user. Es- 
pecially the need for continuous data delivery in case of audio "and video streams contributes to 
these requirements. Additionally, interactivity is highly demanding. In the following, applica- 
tions are categorized according to their specific QoS "requirements (cf.. Table 1): 

• Broadcasting applications', typical examples are Digital Video Broadcast (DVB) and Digi- 
tal Audio Broadcast (DAB). 

• Interactive playback', typical examples include VoD (VCR-like) and teaching/tutoring ap- 
plications. Mostly, playback is intended for a single individual user (or for a single end sys- 
tem, e.g., TV). 

• Conferencing', conferencing applications include audio and video streams. Their main focus 
is on these two media. Data applications, e.g., whiteboard applications, may be incorpo- 
rated to support conferencing. 

• Collaboration: we classify such applications as collaborative, that focus on the collabora- 
tive data part. Audio and video may just be used as supportive tools. Typical examples of 
collaborative applications are joint editing and interactive games. 

• Traditional real-time applications: a typical example can be seen in tele-robotics. 



T able 1 QoS-driven Application Categories 



Data Flow 


Type 


Tmnnr^ Rec^uiren^nLs 


Typical User Requirements 




broadcasting 


isochronous data 


low set up latency (few seconds) 








high downstream bandwidth (several Mbps) 
TV or HIFT quality (colorj resolution, 


asymmetric 


playback 


isochronous data 
(short or long 
streams) 


high downstream bandwidth (video clips) 
low upstream bandwidth (interactions) 
low upstream and downstream latency 






interactivity 


TV or HIFI quality (color, resolution, .„) 




real-time 


interactivity 


typically low bandwidth and latency 
guaranteed maximum reaction time 




conferencing 


isochronous data 


high bandwidth requirements 






interactivity 


low latency 


symmetric 






high video/audio quality (video size, resolution) 






synchronisation (mostly audio and video) 




collaboration 

1 


interactivity ^ 

isochronous data 


moderate bandwidth requirements 
very low latency (e.g., joint editing) 
synchronisation (very fine Eranulaiity) 



In Table 1, two basic types of timing requirements are distinguished: 

• requirements from isochronous data streams, and 

• requirements due to interactivity. 



Isochronous requirements are introduced by applications that include audio and/or video 
streams which need continuous data delivery to the user. This requires guaranteed upper 
bounds on delay jitter. The requested QoS needs to be supported without any dismption dur- 
ing the entire communication phase. Additionally, reliability requirements of isochronous ap- 
plications differ from those of traditional data applications: 
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• reliability is tight to timelines (i.e., data received after a "deadline is considered as lost), 

• no absolute reliability is required (tolerance dependends on "coding and compression). 

Interactivity basically requires very low response times to user actions and, thus, very low 
latency. High bandwidth is not necessarily needed. 

Traditional applications, such as file transfer do not require any specific guaranteed QoS. In 
the Internet Services framework (Braden, 1994) such applications are named elastic applica- 
tions. Elasticity applies to their very loose timing requirements. Timing requirements may, 
however, be more strict if they are part of an application which consists of multiple streams 
(audio, video, data) that need to be synchronized. This implicitely may lead to high service re- 
quirements on such traditional applications as well. Generdly, completely independent streams 
may have different requirements than streams being part of a "more complex application. 

Moreover, many forthcoming applications are inherently based on multicasting. Generally, 
group communication should be seen as a major communication paradigm for the future and, 
thus, should be directly reflected in any attempt to support QoS. It leads to heterogeneous 
QoS support within an interacting group due to different end system and network attachment 
characteristics (e.g., high-end workstations versus PDAs). 

All components involved in the communication need to address these QoS requirements, 
including operating systems and system architectures (e.g., bus access latency and Jitter). 
However, up to now, many investigations on QoS focus on "isolated issues, such as networking 
or individual protocols. Generally, a lack of approaches that integrate protocol and application 
layer processing as well as operating systems and system architecture issues can be observed. 
In most cases, user-to-user services - as they are finally required - are not addressed. For that 
goal, an integrated systemwide QoS management approach is critical. In this context, network 
issues as well as protocol, application, and operating system issues need to be addressed in an 
integrated way. QoS management comprises different tasks, including, among others, local 
and network wide resource management as well as QoS monitoring. This paper discusses 
various issues considering integrated QoS management. Especially an approach towards QoS 
monitoring is presented in some detail. 

The paper is structured as follows. Section 2 presents a clearly defined QoS framework. 
Section 3 concentrates on QoS management itself and on its proper structuring. The QoS 
monitor and its implementation is presented in section 4. Section 5 concludes the paper and 
gives an outlook of future work. 



2 QOS FRAMEWORK 

Since Quality-of-Service (QoS) forms an important part of todays research work in the com- 
munications area, one would assume that a crisp definition of "QoS can be given. This is not the 
case, since the term QoS is defined with different degrees of abstraction. In the following, 
some typical examples are summarized. 

2.1 QoS Definitions 

Described very roughly, Quality-of-Service (QoS) refers to the support of certain service pa- 
rameters in order to eventually accomodate user requirements. There "exists no single definition 
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of QoS. Commonly various descriptions reflecting different levels of granularity are used (cf., 
Table 1). The spectrum includes highly generic definitions as well as detailed ones addressing 
specific parameters and their requirements. This variance also reflects different levels of a 
communication system at which QoS is applied (cf., section "'2.2). 

Table 2 presents some collected QoS definitions. It is by no means ""complete, however, it al- 
ready reflects many flavours of QoS definitions. Basically, every group defines its own QoS 
specification dependent on the resources and requirements of the component under considera- 
tion. For example, the definition used within the ODP framework appears to be oriented to- 
wards the higher levels. It defines QoS as a set of qualitative requirements on the collective 
behaviour of one or more objects (Vogel, 1995). This raises the important issue of collabora- 
tion among different objects. Objects are, however, not defined any further. The ETSI defini- 
tion focusses solely on the QoS specified at the user interface. In contrast, the ITU concen- 
trates mostly on lower levels, e.g., on the network level. However, the ITU recommendation 
E.430 clearly states that user oriented QoS parameters most probably differ from network 
QoS parameters. The latter are called performance parameters. Generally, higher levels tend 
more to deal with qualitative issues and parameters; lower levels mainly address quantitative 
tasks or performance parameters. Thus, different kinds of "'parameters are applied dependent on 
different levels in the communication system. This clearly raises the mapping issue between 
parameters used at different levels. Since QoS definitions that deal with higher levels, such as 
the user and the application level, are somewhat generic, "'mapping them to system and network 
parameters is not always straight forward. Another factor that needs to be addressed here are 
the interactions among system components that may affect "mapping rules. 



Table 2 Various QoS Definitions 



Origin 


QoS Definition 


ISO 


A concept for specifying how good network services are 


OSI ODP 


A set of quality requirements on the collective behaviour of "one or more objects 


ETSI 


QoS is described in terms of a set of user-perceived "characteristics of the per- 
formance of a service. It is expressed in user-understandable language and 
majiifests itself as a number of parameters, all of which have "either subjective or 

objective values 


ITU-T, 1371 


QoS requirements are specified in terms of objective values of "some of the net- 
work performance parameters specified in Recommendation 1.356 


ITU-T, E.430 


Distinction between user QoS and Network QoS 



Independent of the individual specification, QoS eventually deals with certain requirements on 
the system that are expressed using various so-called QoS parameters. These parameters differ 
based on the level being addressed. QoS parameters are somewhat comparable to regular 
system parameters, such as main memory size and CPU power. In contrast to system parame- 
ters, users can not just easily be asked to increase QoS parameters. For example, to buy more 
network bandwidth. However, in case of traditional system parameters, such as main memory 
size, they usually have two choices: buy more main memory or survive with slow applications, 
i.e., with a low QoS. In the worst case, the application will not run at all. Such a behaviour is 
somehow tolerated by the users for single machines, but not for "networked environments. 

System parameters are local to a system, whereas QoS parameters depend on various local 
and non-local resources. Especially a good overall utilization of non-local resources appears to 
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be important. Handling of QoS parameters is somewhat more complex than handling regular 
system parameters, because: 

• requirements address both: individual and shared resources; 

• requirements may not be known in advance by the user; 

• requirements may vary highly over time; 

• parameters are negotiated; 

• parameters may be mapped or translated between different QoS levels; 

• multiple systems/components are involved leading to Interrelations among parameters. 

Therefore, specific QoS management is required. This paper addresses various aspects of QoS 
parameters, including mapping and negotiation. These tasks are incorporated in a framework 
that addresses user-to-user QoS issues and that is currently being implemented. Due to the 
limited space, only selected aspects can be presented. 

2.2 QoS Levels 

Since Quality of Service is related to the dedication of shared resources to networked applica- 
tions, it needs to be considered at different levels (i.e., for various types of resources) within a 
communication system. Therefore, we distinguish three QoS levels (cf., Tigure 1): 

• user level, 

• application level, and 

• network level. 



user Jevel 



application level 



network level 




mostly qualitative 



mostly quaniitaiivo 



external communicatjort 



Figure 1 QoS levels within a communication system 



The user level is directly related to human perception issues. Typically, users state their QoS 
requirements in an imprecise and highly subjective manner. Examples are type and size of a 
video window (e.g., colour and large) in case of a videoconferencing application. The user 
should not need to have deep technical knowledge in order to operate networked multimedia 
applications. Thus, the requirements will usually not be specified very precisely. For example, 
the value large in case of the size of a video window is not "precise and, moreover, is relative to 
the size of the screen used. Additionally, human perception leads to implicit specifications of 
QoS requirements, especially in case of synchronisation. For example, in case of a videocon- 
ference including a shared whiteboard, synchronization requirements are implicitely included. 
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The user level comprises resources that are sources or sinks of data (camera, loudspeaker, 
screen, and the like). User level requirements need to be translated into application level re- 
quirements. Therefore, the communication system holds mapping tables which are used to 
transform the fuzzy and highly subjective and qualitative requirements to more precise and 
quantitative requirements needed at the application level Interface. 

The user level passes mostly quantitative parameters to the application level. Typical ex- 
amples are the size of the application data unit, the maximum data rate expected, and the 
maximum accepted delay. The application level deals with local parameters only. Resources at 
the application level are application-related protocols and protocol mechanisms, the system 
architecture as well as the operating system. 

The operating system interacts with the network level due to its scheduling and memory 
management tasks. Interrupt handling is also an important issue that influences achievable QoS 
guarantees. The network level deals with parameters that are concerned with the quality of 
service of external communication. The network level interfaces to all functionalities that are 
mainly dependent on the network behaviour and that regulate the end-to-end flow. Therefore, 
this level would most commonly be mapped to the traditional transport service interface. Thus, 
it can, for example, be compared with the socket interface. However, the model itself does not 
include any implementation details. For example, it does not consider whether the transport 
layer protocol is implemented inside the kernel of the operating system (as in most traditional 
UNIX systems) or in the user space closely attached to the application itself (what can be seen 
in some newer implementations). 

Between the QoS levels, the type of parameters changes. Within a level, parameters may 
need to be translated, but the type of parameters remains the same. At the user level they are 
more or less solely qualitative parameters. At the application "level, mostly quantitative parame- 
ters appear, mixed with few qualitative parameters. The network level deals with qualitative 
parameters only. These parameters are related to external communication. The network level 
QoS can be compared to end-to-end QoS often referred to in "research papers. 

Using three QoS levels seems to be a cleaner design than distinguishing many levels that 
only have marginal differences. For example, in (Nahrstedt, 1995) five levels are distinguished 
including system and device level. However, the distinction between user and device level is 
not needed if the model is viewed by a proper abstraction. Both levels represent sources and 
sinks of data. Similar arguments hold for the distinction of application and system levels. Ad- 
ditionally, the same type of resource applies to different levels in the model, which does not 
present a clean design. Another approach of decomposing the communication systems into 
different QoS levels is presented in (Campbell, 1996). The authors follow more or less directly 
the OSI reference model; they even add additional layers. A clear separation of tasks and re- 
sources is missing. 

3. QOS MANAGEMENT 

In order to provide the QoS requested by a networked application, resources are required in 
the network, the intermediate systems and the end systems. These resources are typically 
shared among various users (or system components) and, thus, conflicts may happen. How- 
ever, some applications need to deliver data continuously (e.g., audio and video streams) to 
the user. They require service guarantees. Therefore, network "and system resources need to be 
reserved. Traffic characteristics of networked applications may vary over time and are mostly 
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not known in advance. Thus, resources have to be allocated without complete kn 
traffic characteristics. Since maximum reservations lead to low resource utilizatior 
ervation strategies should be chosen, e.g., on some average and maximum values 
this may result in temporary overload situations and, thus, in drop of service quc 
enced by some users. Resource utilization and application behaviour need to be 
continuously in order to avoid any service problems. More generalized it can be 
communication systems need some QoS management in order to properly serve 
multimedia applications. A systemwide QoS management supporting guaranteed i 
service quality is required. QoS monitoring forms an important "part of QoS manage 

3.1 Dual Stack Architecture 

Within a communication system it should be clearly distinguished between regulai 
cation operations and QoS -related tasks. Figure 2 presents a structure that is bai 
different stacks: 

• application stack and 

• QoS management stack. 



.. . , . QoS management 

application stack ^ f 




Figure 2 Dual-stack QoS architecture 

The QoS management stack is responsible for handling tasks related to QoS map 
tiation, setup and monitoring. It is subdivided into several "components, among then 

• a local resource manager and 

• a network resource manager. 

The local resource manager has to form an application profile that contains appl 
cific resource requirements, such as audio/video encoding scheme or encryption re 
An example of an application profile is given in table 3 considering different lev 
communication system (cf.. Figure 1). Internally, the values are derived using k 
manning tables that contain nrofiles At the user level, the local resource manacrer 1 
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ervations or availability checks of presentation related local resources. This can be performed 
without involving the operating system. At the user level, "parameters that are closely related to 
the presentation quality itself are considered. Examples of such "parameters are colour (yes/no), 
video (yes/no), MPEG-2 (yes/no). The reply to these parameters can be seen as a selective 
yes-no-type of answer. It can be returned to the user immediately after consulting the configu- 
ration of the local system. In addition, further parameters may be used in order to describe the 
presentation somewhat more precise. For example considering the size of a video window 
(small, medium, large). These parameters do have a subjective and qualitative nature and they 
usually do not give any precise information. The profile combined with collected experience 
considering the user are applied to resolve such fuzzy QoS "requirement specifications. 

Furthermore, the local resource manager takes care of local resources, such as processor, 
memory space, encoding schemes, application types, and local bandwidth. It informs the appli- 
cation immediately, in case that sufficient local resources "'with respect to the application profile 
cannot be made available. 



Ta ble 3 Example of application profile (see (Nahrstedt, 1995)) 





Audio 


Video 


Application level 


sample size: 8 bit (telephone quality ) 
sample rate: 8 kHz 
playback point: 100-150 ms 


frame rate: 30 fps 
frame width: <= 720 pixels 
frame height <= 576 pixels 
color resolution: I6bit/pixel 
aspect ratio: 4:3 
compression ratio: 2:1 


Network level 


cnd-to-end-delay: <= 150 ms (most) 
400 ms (some) 
round-trip delay: <= 800 ms 
packet loss: <“ 10 
bandwidth: 32 kbps (audioconf ) 


end-to-end delay: 250 ms 
packet loss (one.): <- 10 
packet loss (co.): <= 10" ^ ^ 
bit error rate: <“ 10“^ 
bandwidth: <= 1 .86 Mbps (MPEG) 



The application specific profile is forwarded from the initiator of the communication associa- 
tion to a so-called group mediator. The group mediator is especially helpful in case of hetero- 
geneous QoS requirements within a group communication. In case of a simple point-to-point 
communication, it may not be required. However, point-to-point communication should gen- 
erally be treated as a special case of multipoint-to-multipoint communication because many 
emerging applications are inherently based on multipoint communication scenarios. The group 
mediator may adapt the application profile for group members with different QoS capabilities 
(e.g., considering lower capabilities of PDA systems compared to high-end workstations). The 
resulting application profile is then forwarded to the communication partner(s) during the set- 
up and negotiation phase. It is accompanied by network resource requirements. Forwarding 
network requirements is not sufficient since they do not provide enough information about the 
original user requirements. The profile presents this information to the remote user(s) in order 
to allow a complete user-to-user QoS coverage. Completely bidirectional translation/mapping 
of QoS parameters could avoid sending the profile. However, it seems not to be a practical 
approach because users usually make use of mostly qualitative and often fuzzy requirement 
specifications. This leads to difficulties with bidirectional mapping especially if some knowl- 
edge about user behaviour is included in the decision. Thus, the profile including the mapping 
at the original side is passed to participating communication partners. As a result, a set-up 
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phase is required for every communication association that needs guaranteed QoS support. 
This can be combined with the connection setup phase. Thus, only a marginal additional delay 
should be expected. 

In contrast to the local resource manager, the network resource manager is responsible for 
network-related resources, i.e., for end-to-end QoS. This includes resource reservations on the 
communication path, if needed. 

The network resource manager may initiate specific protocols for signaling and resource 
reservation. These protocols are typically located at the network level of the application stack. 
Currently, several protocols supporting the signaling task associated with the set-up and nego- 
tiation phase are under discussion. For example, Q.239B as signaling protocol within ITU-T 
standards (Stiller, 1995) or RSVP as reservation protocol within the Internet protocol suite 
(Thomas, 1996). 

The QoS monitor then needs to monitor resource utilization with respect to QoS. It is part of 
the QoS management stack (cf.. Figure 2). The design and implementation of a prototypical 
QoS monitor is presented in section 4. 

3.2 Related Work 

An overview of QoS architectures focussing on architectural perspectives can be found in 
(Campbell, 1996). It presents a high-level overview and compares several architectures. How- 
ever, no discussion about implementations and real measurements is included. Additionally, an 
integrated system wide approach of QoS management seems still To be missing. 

In (Nahrstedt, Smith, 1995) an architecture basically reflecting the QoS components shown 
in figure 2 is presented. The main component is called QoS-Broker. The QoS-Broker can be 
directly compared with the QoS -Manager presented in (Schmidt, 1995). Typically, it would be 
part of the QoS management stack introduced above. The dual-stack architecture is derived 
from (Schmidt, 1995). Another QoS Manager is presented in (Quang, 1995). They address 
memory access time and its influence on the observable QoS in a "very general way. 

The overview paper (Nahrstedt, 1995) focusses on resource management. It gives a sum- 
mary of some QoS parameters as they are needed in several layers of a communication system. 
Interactions among different components are discussed at a very high level. No further details 
are given. In order to evaluate their QoS-Broker approach, they use tele-robotics as a test 
applications. They do not consider networked multimedia applications as presented in section 
2. However, such applications are rapidly emerging. 

More detailed experiments of QoS management are presented in (Vogt, 1995). DVI- 
encoded video streams are discussed in order to validate an analytical model. The goal is in 
reducing the amount of resources that need to be reservered and in investigating the trade-off 
between reservations and maximum delay experienced. 

In (Ramanathan, 1995) a QoS management approach for video services is presented. The 
approach needs to be seen in the context of ALF (Application Level Framing (Clark, 1990). 
The basic idea is that if an application frame (here a video frame) is already not valid due to 
missing or erromeous packets of lower layers, all remaining packets for that application frame 
should be discarded right away. A bit indicating the duration of a frame is used in order to 
provide switches with the capability to discard not needed 'packets. 

3.3 QoS Parameters & Mapping 
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QoS is associated with so-called QoS parameters that are used to specify the expected quality. 
Different parameters are applied for the resources at the three levels introduced above. Gen- 
erally, quantitative and qualitative parameters can be distinguished, as presented in (Zitterbart, 
1993). Qualitative parameters basically select whether a certain service property is available or 
not. For example, if encryption is required or not. Moreover, qualitative parameters include 
subjective requirements stated at the user interface. Quantitative parameters are those that can 
be evaluated in terms of certain measures, such as bits per "'seconds, number of errors. 



Table 4 Typical QoS parameters 



Level 


Quantitative QoS Parameter 


Qualitative QoS Parameters 


User 


setup latency, skew 
(both are defined implicity) 


color, presentation quality 
(colour, size, synchronity, fidel- 
ity), number of streams, guaran- 
tee commitment, cost, security, 
group semantics 


Applicatinn 


Protocol, 

etc. 


size (frame, sample), delay (end-to-end, 
round trip), delay jitter, loss rate 
(frame, sample), resolution, compres- 
sion, aspect ratio 


error tolerance (% corrupted, 
replicated, lost), inter-/intrastream 
synchronization, security algo- 
rithm 


Operating 

system 


scheduling accuracy, granularity, 
period time, deadline 


— 


Network 


Protocol, 

etc. 


loss/error rate (cell, frame, packet), 
bandwidth, delay, delay jitter, bit error 
rate, reliability (losses, duplications) 


-- 



Table 4 lists typical QoS parameters of the network, application and user levels. For each of 
the quantitative parameters, mean and peak values may be given. Furthermore, an interval of 
acceptable values may be used for certain parameters (Zitterbart, 1993). An interesting aspect 
can be seen at the user level. Many qualitative parameters are listed. In addition, two quantita- 
tive parameters are given: setup latency and skew. These quantitative parameters are not ex- 
plicitely specified by the user. However, due to the application type and the number of data 
streams involved (e.g., audio, video, whiteboard, ...) they are defined implicitely. Inter- 
relations among different streams are responsible for that. Typical examples are inter-relations, 
such as lip synchronization and pointer synchronization. 

Mapping of QoS parameters between different QoS levels in the system is required in order 
to eventually provide user-to-user service guarantees. In the following, some examples for the 
mapping of QoS parameters are given. They mainly concentrate on rate and delay require- 
ments associated with the parameter mapping. The following "'parameters are defined: 

• S(u) Size of user data unit (e.g., video frame) 

• S(a) Size of application data unit 

• S(n) Size of network data unit 

• Sd(n) Size of data portion of network data unit 



The parameter N reflects the number of data units sent at the user level. The rate requirements 
R can be transformed between neighbouring levels. They depend on the data unit sizes. Thus, 
all operations that manipulate the size of a data unit change the rate requirements. Examples 
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are segementation and reassembly or encoding. The parameter a describes the in- 
crease/decrease of S(a) compared to S(u). This parameter a is calculated based on information 
given by the protocol mechanisms selected at the application level. At the application level 
typically functions such as encoding are located. Segmentation and reassembly is usually not 
part of the application level. 

• R(u) = (N * S(u)) / sec 

• R(a) = (N * S(a)) / sec , with S(a) = a * S(u) 

At the network level, the parameter P reflects the changes with respect to the size of the data 
unit. At this level especially segementation and reassembly need "'to be considered for the calcu- 
lation of p. In addition, the number of data units may change due to segmentation and reas- 
sembly. 

• R(n) = V * N * S(n) / sec, with S(n) = p * S(a) and v = [S(a) / Sc(n)] 

usually, rate requirements increase towards the lower levels: ""RCn) >= R(a) >= R(u) 

In addition, other parameters need to be considered, for example delay. This involves sys- 
tem internal behaviour of the operating system and its scheduling mechanism at both, the ap- 
plication and the network level. 



4 QOS MONITORING 

In order to provide guaranteed services to the user, some control and monitoring is required in 
addition to resource reservations. Therefore, we designed a QoS monitor. In our prototypical 
implementation, the QoS monitor was integrated at the network level (Kanschik, 1996). In 
particular, it was tested with two transport level protocols, namely Sandia-XTP (XTP, 1996) 
and Patroclos (Braun, 1995). These protocols were chosen, since they provide some advanced 
QoS at their service interface. 




Figure 3 QoS monitor implementation structure 



The QoS monitor is implemented on SUN systems running SUN OS 4.1.3. The QoS monitor 
is implemented as part of the QoS management stack shown in Figure 2. It is located between 
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the application stack and the resource managers of the QoS management stack. Moreover, an 
interface to network management, namely SNMP, is provided. Thus, some longer-term QoS 
information can be made accessible to other systems participating in SNMP. 

4.1 Interfaces 

QoS monitoring is needed at different levels within end systems and intermediate systems. 
Thus, the design of the QoS monitor should be modular enough to allow its integration at 
various levels and its co-operation with multiple protocols of different levels within the appli- 
cation stack. In order to achieve a very open design in that sense, an SNMP interface or a 
socket interface could have been used. However, this would lead to a high system overhead. 
Ideally, the QoS monitor should add only a negligable "processing overhead to the communicat- 
ing systems. With that goal in mind, a direct integration of the QoS monitor into the protocol 
implementation of a dedicated level would be favoured. Since this does not result in a modular 
and flexible design, we did not choose this alternative. In order to be flexible and modular, we 
implemented a FIFO queue as shared memory space between the protocol and the QoS moni- 
tor (cf., figure 3). A simple data format is defined for entries in the FIFO queue. Each entry 
consists of four fields: a connection key, an action code, identification of the information type 
and an information field. The format used is defined as "follows: 

typedef struct { address64 connectionid; 

char action; 
char info_type; 

INFO information; 

} IF_TYPE; /* interface type */ 

The protocol adds an entry every time that an associated event occured (put_element 
(IF_TYPE * value)). In addition to the FIFO queue, a request line is implemented. It is used 
by the protocol to inform the QoS monitor that a new entry was added to the FIFO queue. It 
is only used in case the FIFO queue was empty before adding a new entry. Thus, the overhead 
introduced by the request line is kept to a minimum. The QoS monitor removes the entries 
from the FIFO (get_element (if_type *value)). 




C TNe t ^ 

Figure 4 QoS monitor attached to XTP 

In order to interface with SNMP network management, the QoS monitor implements an 
SNMP agent. It also owns a so-called QoS-MIB. The QoS-MIB holds actual information on 
QoS parameters, such as throughput, delay and delay jitter with respect to the established 
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communication associations. Generally it is preferred to have uni-directional data streams re- 
flected in the QoS-MIB. Thus, inbound and outbound traffic "'characteristics are separated. 

In addition to its interface to the protocol, the QoS monitor also provides a graphical user 
interface (cf., figures 5 and 6). This allows the visualization of monitored parameters, such as 
throughput and delay. However, the graphical interface adds substantial overhead. With that 
graphical interface, the performance of Sandia-XTP implementation was limited to a few kbps, 
as reflected in figure 5. The graphical interface is written in Tcl/TK which is an interpreted 
language and, thus, contributes to a high system load. Typically, it would be sufficient to run 
the QoS monitor without this interface or only with a few parameters depicted on the screen. 
For example, a visualization of selected parameters in analogy to a traffic light (green - yellow 
- red) are supported in the current implementation of the QoS "'monitor. The green range shows 
variations of less than 5% of the negotiated value. The yellow zone signals a variation between 
5% and 15%, and the red range signals a variation of more than 15%. Moreover, visualization 
of most information may only be needed in case of problems (e.g., system overload and red 
signal range), but not during regular operation. It is more important that the QoS monitor can 
initialize a signal to the application that currently problems "'are experienced. 

3.2 Monitored QoS Parameters 



The QoS monitor is dedicated to short term measurements of QoS parameters. The reason 
therefore can be seen in the need to reflect temporary overload situations. The resource man- 
agers are informed accordingly and need to decide about actions To be taken. 

For measurements of the actual throughput T the following Tormula is applied. The parame- 
ter a reflects the sampling duration measured in milliseconds. 

CL " 

1=1 



Pi: size of packet i in bits 
Xi\ time of packet i 

This formula basically allows monitoring of the throughput behaviour in case of high load, i.e., 
low packet inter-arrival time. It was chosen, because we mainly aim at detecting overload 
situations. The graphical interface related to throughput "monitoring is depicted in Figure 5. 




Figure 5 Monitoring the actual throughput 



Several other parameters are monitored. The delay is also of major interest. However, it can 
only be estimated. With the XTP implementation we make use of the sreq-Bit. If this bit is set, 
then the receiver has to respond with a control packet. A timer is started, when the packet is 
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sent and stopped after the reception of the associated control packet. The measured time cor- 
responds to the round-trip-time. We estimate the end-to-end delay being half of the round-trip 
time. Delay jitter can also be estimated if it is part of the Traffic contract. 
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Figure 6 Monitoring Interface 



In addition, certain measures related to reliability are supported by the QoS monitor (cf., Fig- 
ure 6). For example, bit error rates can be calculated. With the QoS monitor attached to the 
XTP implementation, error control packets are observed. Xalculation of parameters relevant to 
traditional reliability is initiated by the reception of an "error control packet. However, this does 
not allow for a distinction between bit errors and packet losses. Therefore, an extension to the 
XTP protocol would be needed. In order to calculate the bit error rate, the following formula 
is applied to calculate the number of errors: 

/ nspans 

errors = 8 * seq - rseq - ^ length{spam) 

V 1=1 

The discussion of monitored QoS parameters in this section shows clearly, that there is a tight 
dependency with the monitored protocol. Some parameters may simply not be supported by 
some protocols. Moreover, the calculation of some QoS parameters depends on the protocol 
parameters used in the corresponding protocol mechanisms. In some future work, we aim at 
defining some sets of meta-parameters that describe certain protocol mechanisms. They should 
cover the diversity of parameter usage within the protocols. In "'order to adapt the QoS monitor 
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to a specific protocol, the corresponding set of meta parameters is selected. This will increase 
the flexibility of the monitor. 

3.3 Evaluation 

In summary, the QoS monitor has several features for monitoring and displaying various QoS 
parameters. The overhead associated with each of them has to be evaluated carefully. Espe- 
cially the graphical interface contributes to a high overhead. However, as argued above, it is 
not absolutely required in all cases. The computation of monitored QoS parameters also con- 
tributes to some overhead, especially, since this is currently "Implemented in software. Our QoS 
monitor consumes about 10-15% of the resulting performance. The transfer of a 1 MByte file 
(transferred in 20s) stimulated slightly more than 5000 "events that were forwarded to the FIFO 
queue. The rate of events was still slow enough to cause a "'context switch for every event. This 
appears to be acceptable. In order to further decrease this overhead, the implementation is 
currently being optimized with specific focus on the communication part between protocol and 
QoS monitor. As an alternative, the QoS monitor could be implemented using specific hard- 
ware. This would rapidely decrease the overhead on the protocol ""performance. 



4 SUMMARY AND FUTURE WORK 

Networked multimedia applications impose high requirements on communication systems. 
They need some guaranteed QoS provided at the user interface. Therefore, in addition to the 
network itself, end systems and their internal behaviour need to be addressed as well. Gener- 
ally, an integrated QoS management is needed. Within this paper a framework for QoS man- 
agement is presented. The paper discusses some general issues related to QoS management. 
Furthermore, it presents experiences collected with an "'implementation of a QoS monitor. 

Future work will concentrate on the integration of appropriate operating system support 
and on high performance implementation issues. Moreover, dedicated hardware support for 
selected protocol functions and for the QoS monitor is planned. A complete implementation of 
the QoS management stack is currently under way. 



ACKNOWLEDGEMENT 

First, I would like to thank Ahmed Tantawy for deep and very fruitful discussions considering 
quality of service and its support within communication systems. He highly contributed to the 
content of this paper. Thomas Kanschik implemented a first version of the QoS monitor. Urs 
Thurmann is currently pursueing this work towards a better "'performing implementation. 



REFERENCES 

Braden, R., Clark D.; ShenkerS. (1994) Integrated Services in the Internet Architecture: an 
Overview; RFC 1633, Juni 1994 

Braun, T., Zitterbart, M. (1992), Parallel Transport System Design, 4th IFIP Conference on 
High Performance Networking, Luttich, Belgien, Dezember 1992 




234 



Part Six 



QoS 



Campbell, A., Aurrecoechea, A., Hauw, L. (1996); Architectural Perspectives on QoS Man- 
agement in Distributed Multimedia Systems; Workshop on Quality of Service, Paris, 
France, March 1996 

Clark, D., Tennenhouse, D. (1990); Architectural Considerations for a New Generation of 
Protocols, ACM SIGCOMM, 1990 

rrU-T 1.356, 1993; B-ISDN ATM Layer Cell Transfer Performance; "November 1993 
Kanschik, T. (1996); QoS monitor; Diploma thesis (in "'German), TU Braunschweig, 1996 
Kumar, V. (1995) MBone - Interactive Multimedia on the "Internet; New Riders, 1995 
Kurose, J.F. (1992); Open Issues and Challenges in Providing "'Quality of Service Guarantees in 
High-Speed Networks; ACM Computer Communication Review, 1992 
Nahrstedt. K., Steinmetz, R. (1995); Resource Management in Networked Multimedia Sys- 
tems; IEEE Computer, May 1995 

Nahrstedt, K. , Smith, J. (1995); The QoS Broker; IEEE "Multimedia, 1995 
Quang, N.H., Bernard, G., Belaid. D. (1995); System Support for Distributed Multimedia 
Applications with Guaranteed Quality of Service; High Performance Networking VI, 
Palma de Mallorca, Spain, Chapman&Hall, 1995 
Ramanathan, S., Rangan, P.V., Vin, H., Kumar S.S. (1995); Enforcing application-level QoS 
by frame-induced packet discarding in video communications; Computer Communica- 
tions, Vol. 18, No. 10, October 1995 

Schmidt, C., Zitterbart, M. (1995); Towards Integrated QoS Management; LAN/MAN Work- 
shop, Florida, 1995 

Stiller, B. (1995), A Survey of UNI Signaling Systems and Protocols for ATM Networks; 

ACM Computer Communications Review, April, 1995 
T120 (1994) rrU-T; Proposed Draft of T.120: Data Protocols for Multimedia Conferencing; 
Marz 1994 

Vogel, A., Kerheve, B., von Bochmann, G., Gecsei, J. (1995); Distributed Multimedia and 
QoS: A Survey; IEEE Multimedia, Summer 1995 
Vogt, C. (1995); Quality-of-service management for multimedia streams with fixed arrival 
periods and variable frame sizes; IEEE Multimedia, 1995 
XTP (1996), http://www.ca.sandia.gov/xtp/SandiaXTP/ 

Zitterbart, M., Stiller, B., Tantawy, A.N.; A Model for Flexible High-Performance Com- 
munication Subsystems; IEEE JSAC, 1993 



BIOGRAPHY 

Martina Zitterbart received her doctoral degree in computer science in 1990 from the Univer- 
sity of Karlsruhe, Germany. From 1987 to 1995 she was research assistant at the Unitersity of 
Karlsruhe. 1991 and 1992 she was on leave of absence at the T.J. Watson Research Labora- 
tory, New York, USA. Since 1995 she is full professor at the Technical University of 
Braunschweig. Her research interests are in the area of multimedia systems, protocols, mobile 
communication and CSCW applications. 




INDEX OF CONTRIBUTORS 



Ahlgren, B. 167 

Baldi,M. 29 
Bianco, A. 29 
Biersack, E. 134 
Bjorkman, M. 167 
Bonaventure, O. 60 
Boutaba, R. 49 
Braun, R. 182 

Campbell, A. 201 
Campbell, R.H. 77 
Casetti, C. 13 
Chassot, C. 91 
Coulson, G. 201 

Danthine, A. 60 
Davoli, F. 3 
Diaz, M. 91 
Diot, C. 182 

Fdida,S. 121 
Fournier, M. 91 



Ginningberg, P. 167 

Huitema, C. 109 
Hutchinson, D. 121 

Khan,0. 3 
Klovning, E. 60 
Kurose, J. 13 
Kushida,T. 149 

Liao,W.S. 77 
Lo Cigno, R. 29 
Lozes, A. 91 

Marsan, M.A. 29 
Maryni, P. 3 
Mauthe, A. 121 
Mehaoua, A. 49 
Munafo, M. 29 

Nonnenmacher, J. 134 

Pujolle, G. 49 



de Rezende, J.F. 121 

Sano, T. 149 
Shiroshita, T. 149 

Takahashi, O. 149 
Tan, S.-M. 77 
Towsley, D. 13 

Yamanouchi, N. 149 
Yamashita, M. 149 

Zitterbart, M. 219 





KEYWORD INDEX 



Adaptive algorithm 13 
Admission control 13 
Asynchronous transfer mode (ATM) 29, 
49, 60, 77 
networks 29 

Automated code generation 182 

End-to-end control 29 
Error control 121 
Error correction 109 

Forward error correction 109, 134 

Group conununication 121 
High-performance conununication 182 

ILP 167 

Implementation 91 
Integrated layer processing 167 
Integrity conditions 121 
DP over ATM 29 

Large-scale delivery 149 
Load estimation 3 

Measurements 60 
Monitoring 219 
MPEG coding 49 
Multicast 109 

transport protocols 121 



Multimedia 3 
application 91 
transport architechture 91 

Network Interface Framework (NIF) 77 

Partial order connection 91 
Performance 167 
analysis 149 
evaluation 29 
Protocol 

architecture 182 
implementation 167, 182 

Quality of Service (QoS) 13, 49, 109, 
219 

architecture 201 
management 219 

Rate-based flow control 149 
Reliable multicast 134 
protocol 149 
transport 109 
Reliability 121 
Resource exchanger 77 

Scalability 134 
Self-similarity 3 
Simulation 13 
STREAM concept 91 





Keyword index 



237 



TCP 60 

TCP-XTP file transfers 29 
Topology 134 
Traffic contract 60 
Transport 

protocol 29, 91, 109 
service 91 



Usage parameter control (UPC) 60 

Variable bit rate (VBR) 60 
VBR video 49 

x-kernel 77 




